Generating Voter Turnout Scores from Scratch

The Problem: Non-existent Voter Turnout Scores

When hitting the campaign trail, the most fundamental question a campaign needs to answer is: who is going to vote in my election? Fortunately, for many elections, the Democratic National Committee (DNC) has provided a turnout score (probability of voting) for each voter in VoteBuilder. The DNC assigns these scores using their own proprietary model. But what if there weren’t any turnout scores for your election in VoteBuilder?

...the most fundamental question a campaign needs to answer is: who is going to vote in my election?

My team faced that exact challenge while volunteering with a 2021 campaign for Boston City Council. Consequently, we had to find another way to help prioritize our campaign’s voter outreach strategy. This was especially important because this was an “off-year” municipal election, and was therefore expected to have very low turnout. Whereas the 2020 General Election saw a turnout rate around 67%, this election was expected to have turnout around 33%.

If our campaign talked to every potential voter to persuade them toward our candidate, 2 in 3 conversations they had would be with people who never ended up voting. While reaching out to people who don’t often vote is essential to promote long-term civic & electoral engagement, in this election our campaign had to target our resources carefully, and trying to increase turnout in local elections like this one typically has minimal returns. Therefore, generating a turnout model for our campaign without the DNC's turnout score was essential.

...in this election our campaign had to target our resources carefully, and trying to increase turnout in local elections like this one typically has minimal returns. Therefore, generating a turnout model for our campaign without the DNC's turnout score was essential.

Though we could have compiled a number of VoteBuilder lists using a variety of complicated filters, we chose to pursue a more comprehensive process of creating our own turnout score model. This allowed us to use voter features like age, voting history, etc. to predict voter turnout. We decided to use logistic regression, which is a probabilistic classification model. Our model gave us the probability a person would vote, instead of just predicting a binary “voter” or “non-voter” label for that person. Knowing voter probability enabled us to widen or narrow our target voter pool by lowering or raising our “threshold value.” For example, we could decide that we only want to reach out to voters with a 75% or greater probability of voting, in order to efficiently use our campaign’s resources.

Our model gave us the probability a person would vote, instead of just predicting a binary “voter” or “non-voter” label for that person.

Image Credit

With this model, we helped our campaign focus on just those voters with the greatest likelihood of turning out in November, helping them save valuable time and resources along the way. When we split our data into training and test sets, the model had ~78% overall accuracy. For reference, turnout in the 2021 Boston municipal election was only ~33%. Without proper targeting, our campaign would have targeted the right voters only 33% of the time! Clearly, our model’s 78% accuracy helped our campaign reach voters twice as effectively.

...we helped our campaign focus on just those voters with the greatest likelihood of turning out in November, helping them save valuable time and resources along the way.

The Solution: Building Voter Turnout Scores from Scratch

To train our logistic regression model, we used voter characteristics (or “features”) to predict turnout in a municipal election. The features we experimented with included age, gender, ethnicity, and voting history. To expand our training data, we used data from the 2017, 2013, and 2009 municipal elections. This meant we had to adjust some of our voter features (like age and voting history) based on which municipal election turnout we were predicting.

For our election, we used a logistic regression model to predict voter turnout for voters with 6+ years of voting history. For newer voters, we pulled lists in VoteBuilder using our own specified criteria, because we didn’t have enough voting history to include them in our model. This included criteria such as: voted in the last general election, registered to vote within the last year, However, the methodology described here is just one way to approach this problem. Depending on the election you are focusing on, you may find that different timeframes and voter characteristics are more helpful to you.

We found that the two main features that allowed us to predict voter turnout were voting history and age.

We found that the two main features that allowed us to predict voter turnout were voting history and age. Although our methodology differs, inspiration was taken from Bloomberg’s 2020 turnout methodology. The fact that we ended up using voting history is consistent with Bloomberg’s finding that “…historical turnout in each county is the best predictor of future turnout, which aligns with the well-established finding that voting is habitual.” Additionally, Bloomberg included age in their model due to its high correlation with voting. They supplemented these features with additional metrics such as competitiveness of the election and local voting laws. With these features, Bloomberg built a number of tree-based models to predict turnout vs. our logistic regression model. However, for our local election we found that a logistic regression model using voter history and age was sufficient to get the level of accuracy our campaign was looking for.

Full Jupyter Notebook can be found here.

The Results

In our regression results, we found that the most likely voter to turn out is in their 60s, voted in the last two presidential elections and the midterm election between, and voted in the prior three municipal/off-cycle elections. This made intuitive sense and helped us gut check our model’s performance. Of course, although these characteristics might be the strongest predictors of voting in our model, most voters do not have all of these characteristics themselves. They’re pretty rare! Plus, your model might say that a 60 year old is the most likely to vote, but if your district is near a university town (like ours was), most of your voters might actually be in their 20s. It is therefore critical to set your thresholds properly so you target enough voters to win your election and avoid exclusively focusing on a small group of extremely likely voters.

It is therefore critical to set your thresholds properly so you target enough voters to win your election and avoid exclusively focusing on a small group of extremely likely voters.

When we tested our model on our test data, we found that our overall accuracy was 78% when using a 50% threshold value. We correctly identified 3,421 voters who did not turn out. We also correctly identified 2,570 voters who did turn out. On the other hand, we mistakenly predicted that 771 voters would vote who did not, and we missed 892 voters who did. Given that only ~30% of voters turnout in Boston’s municipal elections, our campaign was satisfied with a 78% accuracy rate.

Given that only ~30% of voters turnout in Boston’s municipal elections, our campaign was satisfied with a 78% accuracy rate.

Of the 3,341 (2,570 + 771) people in our test set that we identified as voters, 77% were actually voters (2,570 / 3,341).

Depending on your team’s resources, this might be sufficient. However, many campaigns have limited resources and might have a target number of voters they think it is possible to reach. By raising your threshold value, you can focus on just those voters who are most likely to turn out. The tradeoff is that you will be targeting a smaller group, so of course you will miss more voters. This is a tradeoff that you will need to consider carefully. When we raised our threshold value to 75%, we identified 1,020 people in our test set as voters. Within this group, 91% were actually voters.

When we raised our threshold value to 75%, we identified 1,020 people in our test set as voters. Within this group, 91% were actually voters.

Once our model was trained and tested, we used it to assign turnout scores to voters for the 2021 municipal election. Then, based on our campaign’s resources we used various thresholds to group our voters and prioritize outreach accordingly. We could then visualize some key statistics about our likely voters. For instance, we found that likely voters in our election tended to be between 55-80 years old, Black, women, and were predominantly located in the 12th and 9th wards.

Applying voter data to predict voter turnout

In conclusion, we created a logistic regression model to predict voter turnout scores for an upcoming election. To expand our training data, we included voter turnout and voter characteristics for three comparable municipal elections (2017, 2013, and 2009). We experimented with various voter characteristics from VoteBuilder, and after some exploratory analysis we landed on using age buckets and voting history as our independent variables. On test data, our model had ~78% accuracy.

By setting various likelihood thresholds, we could then group our voters and help our campaign prioritize outreach accordingly.

Once the model was created, we applied it to our 2021 voter data to predict turnout in the 2021 municipal election. By setting various likelihood thresholds, we could then group our voters and help our campaign prioritize outreach accordingly. Like many of the campaigns Bluebonnet supports, our campaign participated in a down-ballot, local election and therefore did not have the resources of a larger campaign. Without the DNC’s turnout score, our campaign needed an alternative way to make sure they spent their time and money most effectively. While humans are unpredictable, this model did a very decent job of helping our campaign identify the voters they needed to reach.

Without the DNC’s turnout score, our campaign needed an alternative way to make sure they spent their time and money most effectively. While humans are unpredictable, this model did a very decent job of helping our campaign identify the voters they needed to reach.

About the Author

Anne Bode (she/her) was a Data Fellow for Angie Camacho’s campaign for Boston City Council in District 7. She is currently finishing her MSc in Business Analytics at Imperial College London and is a data scientist at Revel in NYC.

If you like what you’ve read and want to learn more, you can reach out at info@bluebonnetdata.org. Or, If you're interested in doing similar work, apply to be a Data Fellow!

Follow Bluebonnet Data on: Twitter | Instagram | Youtube | Facebook | LinkedIn