Our forecast, taken as a whole, is a multilevel model (MLM). Each district has its own forecast generated, and has its uncertainty (or probability) calculated around its prediction.
For each district in the model, we make an assumption of normality. This means that we assume a normally-distributed error for the probability distribution generated for each district. We model the probabilities of the prediction error on a cumulative distribution function based on logistic distribution. The logistic distribution curve we employ looks like the one on the left. Later, we explain more how we derived the variance of the distribution.
The assumption of homoscedasticity is also made, that the prediction error for each district is of the same variance and probability distribution.
We also assume that all of the districts' prediction errors, or observations, are not independent of one another, and rather are completely correlated to one another, based on the overall nationwide popular vote (which we derive from generic ballot polling). This means that we do not assume that each district’s probabilities are independent from one another. We do this in order to ensure that the model reflects an appropriate level of uncertainty. We could have used a Monte Carlo method, which depends on running thousands of random simulations of error across every district, but the problem with this method is that it does not take into the heavy impact the national environment has on the overall win probabilities. In 2016, models that operated on a strong assumption of independence between state-level polling error performed worse (such as those by Sam Wang and the Huffington Post) than those with a stronger assumption of correlation between the errors (such as FiveThirtyEight). Furthermore, it would not make much sense to make an assumption of independence on the district level as we have little polling data available yet for congressional races, and rather are currently relying on the parameter of generic ballot polling and the way districts are structured.
To generate each congressional district’s prediction, or estimate, we factor in four different variables. One of them is the binary variable of whether a seat is open or not. An open seat refers to a district whose incumbent is not running for re-election. The other three variables are metrics of the district’s partisan leaning based, or the strength of each party in that district, based on the last two presidential elections and the last congressional election. One variable of partisan lean is how the district voted in the 2012 presidential result relative to the nationwide popular vote. Another indicator of partisan lean is how the district voted in the 2016 presidential result relative to the nationwide popular vote. The last variable of partisan lean is how the district voted in the 2016 congressional election relative to the nationwide popular vote.
We calculate our prediction for each congressional district based on these measures of partisan lean, coupled with our estimate of the national popular vote. For each district, here is how we weight each measure of partisan strength based on whether the seat has an incumbent running or not:
25% - 2016 congressional partisan lean
37.5% - 2016 presidential partisan lean
37.5% - 2012 presidential partisan lean
75% - 2016 congressional partisan lean
12.5% - 2016 presidential partisan lean
12.5% - 2012 presidential partisan lean
By weighting seats with incumbents differently than open seats, we account for the incumbent's advantage or strength as a candidate. With every district factored in, the median district leans 8.9 points more Republican than the overall popular vote. The median district represents the 218th seat, which is needed to capture the majority. Thus, we assume that Democrats would have to win the popular vote by approximately 8.9 points.
The generic ballot refers to a survey that asks respondents whether they will vote for Democrats or Republicans for Congress. We use this to determine the national environment, and then calculate the forecast for every district relative to it, based on partisan leaning and incumbency. Democrats have maintained a solid lead in the generic ballot ever since Trump's election. In general ballot polling, however, there has been a pretty persistent overestimation of Democratic support in all of the recent election cycles. This is a phenomenon that is often overlooked when discussing generic ballot polling, and yet is one that FiveThirtyEight has previously acknowledged
. Consequently, we account for the consistent over-estimation of Democratic levels of support in determing November House election returns as relative to the polling average in March and April.
Thus, we attempt to correct our national popular vote estimate in light of this bias. If we we took the median of the polling error since 1998, we would be applying a bias correction of 4.8 points. However, there is some evidence that the gap is narrowing recently, as 2012 was actually approximately 4 points more Democratic in outcome than March and April polling averages suggested. Thus, we are taking the median of the last three election cycles - 2012, 2014, and 2016, which represents a bias of 1.1 points.
All of these elements come together to form a comprehensive House forecast