Pandas Time series with Election Data

4 min readOct 28, 2018

Exploratory Data Analysis use cases

For a lecture on Python tools for data scientists, I prepared few data used cases, to demonstrate pandas,numpy and matplotlib functionality.
In this tutorial, US Election data is used. The data is published by the Federal Election Commission and can be downloaded here.
The election data website, contain many data sets, with election information. Data about Candidates, Committees, fundraising transactions, contributors information, etc.

Candidate data

Download Form2Filer_2018.csv, from election data website.

df_cand = pd.read_csv('Form2Filer_2018.csv')
df_cand.head()

Filter Candidate data

Let's focus only on the 2018 election to the Congress. Meaning, we select only the data with the following features.

election_year_2018 = (df_cand['ELECTION_YEAR'] == 2018 )
congress_candidate = ~(df_cand['CANDIDATE_OFFICE'] == 'President')
df_cand = df_cand[election_year_2018 & congress_candidate]

Manage Missing values

Check missing values with info() pandas method.

df_cand.info()
CANDIDATE_ID        3041 non-null object
CANDIDATE_NAME      3041 non-null object
PARTY               3041 non-null object
PARTY_CODE          3036 non-null object
CANDIDATE_OFFICE    3041 non-null object

Number of candidates missing party values is small, so we delete them with drop pandas function.

drop_cand = df_cand.PARTY_CODE.isnull() | (df_cand['PARTY']=='Unknown')
df_cand = df_cand.drop(df_cand[drop_cand].index)

Explore Party data

To visualize the number of candidates per party, we use pandas values_count() method on PARTY_CODE column, and plot it as a bar plot.

df_candidates.PARTY_CODE.value_counts().plot(kind=’bar’)

Lets bin all values of parties with small amount of candidates to one category, Other. To do that, we build a list of the relevant parties names, then use apply pandas function, to map all other party name values to Other.

party_vals = df_cand.PARTY.value_counts()
big_party_names = [x for x in party_vals.index if party_vals[x]>30]
def map_parties(p):
    if p not in big_party_names:
        return 'Other'
    return p
df_cand.PARTY = df_cand.PARTY.apply(map_parties)

To visualize candidate distribution, per party and per office, group by party, and count in each group, number of Senate candidates. Then unstack the result, to flip the table, so party column, becomes feature labels. To plot “transformed” bar plot, we use kind=barh parameter value.

df_cand.groupby('PARTY').CANDIDATE_OFFICE.value_counts().unstack(0).plot(kind='barh')

For simplicity, delete all candidates with PARTY different than Republican or Democrat.

Contribution data

Contribution data contains all transaction related to election fundraising. Data is separated to contribution from committees and contribution from individuals and organizations. We will use the Contributions by individuals data files.

Data files are big, so for this demonstration, I sampled from each file 1% of its content, and concatenate it into one data frame, that can be downloaded from the git.

Raw files here: https://www.fec.gov/data/advanced/?tab=bulk-data
The header for those files is kept in a separate csv file. Feature description here.

We will focus on the following features, Data and Amount of transaction.

sam_dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
df_contr = pd.read_csv('sampled_itcont_2018.csv'),
                      parse_dates=['TRANSACTION_DT'], date_parser=sam_dateparse)

Committee data

Contribution data is organized by committee, so we need to link each committee to the suitable candidate. The file contains a lot of information, we will read only the id columns.

df_committee = pd.read_csv('committee_summary_2018.csv',
                           usecols=['CMTE_ID', 'CAND_ID'])

Merge contributor data with committee data to associate each contribution with candidate and party. Read more.

df_merge = pd.merge(df_contr, df_committee, on=’CMTE_ID’)
df_merge.head()

Merge the result with candidate data, to link the contribution to a party. The candidate id feature, has different column names in both data frames, so we need to use left_on and right_onmerge parameters to state those column names.

df_merge = pd.merge(df_merge, df_cand.set_index('CANDIDATE_ID'), 
                    left_on='CAND_ID', right_index=True)

We can ask many questions on the data, like:

What countries contribute the biggest contribution ?

df_big_cont=df_merge[df_merge.Amount>7000]
ax = df_big_cont.groupby(['PARTY', 'STATE']).Amount.sum().unstack(0).plot.barh(figsize=(15,7))

Is employer correlated with Party ? Plot transaction sum per party of the 10 largest values in Employer column.

df_big_cont=df_merge[df_merge.Amount>7000]
ax = df_big_cont.groupby(['EMPLOYER',    'PARTY']).Amount.sum().unstack(0)[0:10].plot.barh()

Explore Contribution Amount Time series

To plot daily contribution amount time series, Group contribution by date, and sum.

for n in df_merge.PARTY.unique():
    df_merge[df_merge['PARTY']==n].groupby('Date').Amount.sum().plot()

Visualize Trend

To visualize trend, we plot the rolling mean of the transaction amount, summed daily.

df_merge[df_merge.PARTY_CODE=='DEM'].groupby('Date').Amount.sum().rolling(8).mean().plot()
df_merge[df_merge.PARTY_CODE=='REP'].groupby('Date').Amount.sum().rolling(8).mean().plot()

Visualize Seasonality — First difference

To visualize seasonality, we remove the trend using Pandas method to compute and plot the first order difference of the contribution amount.

for n in df_merge.PARTY.unique():
    df_merge[df_merge['PARTY']==n].groupby('Date').Amount.sum().diff().plot()

Next Steps

Many aggregation and exploration can be done on this data.
An interesting project will be to download historical contribution data from previous election cycle, and look for features correlated with winning the elections, to try and predict, the current winners in the upcoming election.

Full source on my Github.