From Classroom to Campaign: How I used regex & multithreading to collate opposition research

Conceptually in Class : Finding relevant applications of academia in industry

The difference in practical concepts between the university setting & 'the industry' is a hotly debated topic among computer science students like myself. Students looking to distract themselves from the demanding workload begin to speculate on how much of that material will apply to 'the industry.' The industry is in quotes because, for many, it exists as terra incognita. Yes, the goal is to get there. Yes, we've met and talked with those in it. But we are in university and not there. We are not in the industry nor have experience of it. Rumors fly, and the student's coursework seems excessive and contradictory to the stories: “Google engineers only write one line of code a day” or “Microsoft workers get paid 550k to work 6 hours a month.” We have no idea and are skeptical that every algorithm, hashmap, and discrete math proof will help our future. Getting into the Bluebonnet cohort, I was unsure of how much my CS coursework would integrate.

We have no idea and are skeptical that every algorithm, hashmap, and discrete math proof will help our future. Getting into the Bluebonnet cohort, I was unsure of how much my CS coursework would integrate.

I had transferred from a community college to a four-year university and assumed I had left the 'weeder' classes behind. It would be all fun and optional from this point on. I was getting ready for the industry. Google search and StackOverflow were going to write my code now, not the concepts I learned in class. Weeder classes disappearing wasn't a false assumption, but something else had occurred. Instead of 'weeder' classes, something which I coined 'groan' classes had begun. These classes were entirely passable but included concepts students regarded as “not necessary” in the industry. A collective groan would reign from the hall when these concepts appeared in a lecture. Groan course number one, CSE 102, topic: regular expressions (regex for short). Groan course number two, CSE 130, topic: concurrency and multithreading.

Instead of 'weeder' classes, something which I coined 'groan' classes had begun. These classes were entirely passable but included concepts students regarded as “not necessary” in the industry.

Image of man with subtitles of "Impossible. Perhaps the archives are incomplete." With the title of "When you can't find an example of code to steal from StackOverflow — What I was told working industry would be like (Image Courtesy of Programmer Cave)

Applying Academia: Converting concepts into usable solutions

As a way to further prepare for the industry, I decided to gain some necessary, real-world experience through Bluebonnet Data’s Data Fellowship. This opportunity brought me to my team’s first client, Focus Action Network. Focus Action Network is an organization that provides progressives with project management, text & call banking, relational organizing, tech development, legal support, and communications services.

Focus Action Network wanted opposition research to determine if a candidate was moderate or had extremist/regressive elements in their voting patterns. Voting records were on a public website but had many pages to access. Such an amount that we didn't know the vote count. After creating the final product, I discovered each year had about 2,500 web pages for ten years. Meaning nearly 25,000 web pages would have to be looked at. For the state senate, this meant 500k voting records across the years. For the state house, nearly 3.5 million records. A total of 4 million voting records, ouch. There was zero, zero percent chance that any of this could be accomplished by hand or with tricks. Of course as I’d come to find out, the tricks were CS principles I had gained during time at university.

A total of 4 million voting records, ouch. There was zero, zero percent chance that any of this could be accomplished by hand or with tricks. Of course as I’d come to find out, the tricks were CS principles I had gained during time at university.

I began to iterate through each part of the problem, breaking it down into smaller and smaller pieces until the entire thing would be manageable. I made a piece to tally votes for each webpage. I made a piece to get all voting measures from a single year. I felt that Google search and StackOverflow were powering me through. I had successfully created a client solution while allowing myself to be oblivious to any more profound CS concepts I had learned.

Profoundly wanting to prove the groan classes wrong, I powered on. Two more pieces had to be created. First, a bit needed to be made to get all voting measures from a single voting session's web page. As much as I tried to avoid it, my solution to this part relied on regex, the subject of my first groan class. Regex is a pattern matching tool applied to words or paragraphs. Looking to match all words that start aab? Use aab*. Pattern matching in regex allowed us to extract the voting record web pages hidden within higher-level web pages. I used something from an academic setting to create a customer solution. It felt like a breakthrough in my CS career. I was ecstatic that the perseverance in my regex classes had paid off tangibly. So thrilled that I even Snapchatted my CS buddies in the excitement of regex's use.

Pattern matching in regex allowed us to extract the voting record web pages hidden within higher-level web pages. I used something from an academic setting to create a client solution. It felt like a breakthrough in my CS career. I was ecstatic that the perseverance in my regex classes had paid off tangibly.

Visual map of Regular Expression E-mail Matching Example. Contains logic wireframes — Regex principles & examples. In my project I used this line 'theDate=\d{2}/\d{2}/\d{4}' to find a pattern equal to ‘theDate=mm/dd/yyyy. Thousands of those pattern matches could be appended to a webpage prefix allowing for traversal of thousands of web pages (Image Courtesy of Computer Hope)

I completed all the parts of the program. It was time to give a test run for a single year or web pages. I clicked run. Waited, waited, and watched an episode of The Office. Finally, the program had completed the year of web pages. Nearly 40k voting records had been transcribed into memory. Why did it take almost 25 minutes, though? Python, on average, can examine 500k rows of data in one second. At this rate, it should have only taken a tenth of a second to see & transcribe the 40k records. What was the problem? Well, retrieving data off local memory or disk is far faster than traversing the web. Each traversal of the net to receive a voting record could take up to a second. At this point, my code would find a webpage, make a request, wait, wait, and then receive the record to write. If only somebody in the CS world had thought about this problem before. Maybe even a whole course could have been created about it?

Well, retrieving data off local memory or disk is far faster than traversing the web. Each traversal of the net to receive a voting record could take up to a second. At this point, my code would find a webpage, make a request, wait, wait, and then receive the record to write.

Computers complete billions of actions a second, and waiting a second for a web request, from the computer's perspective, would be like mailing out a letter waiting a million years, and then receiving a response. Thankfully this could all be solved by a concept from the second 'groan' class, CSE 130. This concept is referred to as multithreading. The CPU can complete a million other things in that' million-year wait. So why not send a million requests at once. To the human user, a million requests would look no different than a single request. The computer would be happy and efficient.

The CPU can complete a million other things in that' million-year wait. So why not send a million requests at once. To the human user, a million requests would look no different than a single request. The computer would be happy and efficient.

Diagram explaining multithreading. Title is Process, over a purple circle with the following within it: "Thread #1" on left with two small squiggles vertically below it, on right is "Thread #2" with one squiggle below it vertically. Arrow outside of the circle points downwards, and is captioned with "Time" — Overview of multithreading operation. The CPU can jump back and forth, working between all active threads simultaneously. (Image Courtesy of WikiCommons)

Multithreading was implemented, and a year of records was completed in one or two minutes. I felt like I had been struck by lightning twice. Regex and multithreading in the same project? Providing customer value? Not on a toy set? Again, in excitement, I had snapped this rarity to my CS buddies.

Multithreading was implemented, and a year of records was completed in one or two minutes.

As a result we’ve been able to give the client a central source for their research. Before they had volunteers combing through physical copies of these records. I’m glad we got them up to speed with a concise and searchable data solution. They’ve been able to replace the process of physically reaching through logs, which is a huge win for everybody involved.

... we’ve been able to give the client a central source for their research. Before they had volunteers combing through physical copies of these records. I’m glad we got them up to speed with a concise and searchable data solution.

They can now use this opposition research to guide political strategy at a high level, identifying the best messaging with which to reach persuadable voters to flip state legislatures in top swing states. The response from Focus Action Network was that “the people we showed this info to absolutely freaked out at how awesome it was…they were legitimately awed”.

They can now use this opposition research to guide political strategy at a high level, identifying the best messaging with which to reach persuadable voters to flip state legislatures in top swing states.

Implementing tools from university was a true blessing. Reducing the time to create our records from unknown duration to weeks, hours, to a minute was satisfying. Maybe I will apologize for the groans I gave during various CS courses. Right now, I will stay happy and content with the time saved in our data infrastructure and the progressive causes it supports.

“The people we showed this info to absolutely freaked out at how awesome it was…they were legitimately awed” - Focus Action Network partners

About the Author

Brendan Schierloh (he/him) is a Bluebonnet Data Fellow of the 2022 cohort. Most recently, he interned as a Product Analyst for the San Francisco Federal Reserve. Brendan attends the University of California Santa Cruz with an expected graduation date of December 2022 in Computer Science. Prior to university, Brendan enlisted for five years in the United States Marine Corps as an avionic tech. He is skilled in AWS/GCP, data engineering, and data analysis. Brendan enjoys reading, playing and watching soccer (Chelsea), and Formula One (Team LH). He can discuss Formula 1 for hours and has even built a couple of datasets to back up his ideas! Brendan is seeking data engineering or analyst roles after graduation, feel free to reach him on LinkedIn.

If you like what you’ve read and want to learn more, you can reach out at info@bluebonnetdata.org. Or, If you're interested in doing similar work, apply to be a Data Fellow!

Follow Bluebonnet Data on: Twitter | Instagram | Youtube | Facebook | LinkedIn

From Classroom to Campaign: How I used regex & multithreading to collate opposition research

Conceptually in Class : Finding relevant applications of academia in industry

Applying Academia: Converting concepts into usable solutions

About the Author

Recent Posts

Kommentare