Pandas Time series with Election Data
Exploratory Data Analysis use cases
For a lecture on Python tools for data scientists, I prepared few data used cases, to demonstrate pandas,numpy and matplotlib functionality.
In this tutorial, US Election data is used. The data is published by the Federal Election Commission and can be downloaded here.
The election data website, contain many data sets, with election information. Data about Candidates, Committees, fundraising transactions, contributors information, etc.
Candidate data
Download Form2Filer_2018.csv, from election data website.
df_cand = pd.read_csv('Form2Filer_2018.csv')
df_cand.head()
Filter Candidate data
Let's focus only on the 2018 election to the Congress. Meaning, we select only the data with the following features.
election_year_2018 = (df_cand['ELECTION_YEAR'] == 2018 )
congress_candidate = ~(df_cand['CANDIDATE_OFFICE'] == 'President')
df_cand = df_cand[election_year_2018 & congress_candidate]
Manage Missing values
Check missing values with info() pandas method.
df_cand.info()
CANDIDATE_ID 3041 non-null object
CANDIDATE_NAME 3041 non-null object
PARTY 3041 non-null object
PARTY_CODE 3036 non-null object
CANDIDATE_OFFICE 3041 non-null object
Number of candidates missing party values is small, so we delete them with drop
pandas function.
drop_cand = df_cand.PARTY_CODE.isnull() | (df_cand['PARTY']=='Unknown')
df_cand = df_cand.drop(df_cand[drop_cand].index)
Explore Party data
To visualize the number of candidates per party, we use pandas values_count()
method on PARTY_CODE
column, and plot it as a bar plot.
df_candidates.PARTY_CODE.value_counts().plot(kind=’bar’)
Lets bin all values of parties with small amount of candidates to one category, Other
. To do that, we build a list of the relevant parties names, then use apply
pandas function, to map all other party name values to Other
.
party_vals = df_cand.PARTY.value_counts()
big_party_names = [x for x in party_vals.index if party_vals[x]>30]
def map_parties(p):
if p not in big_party_names:
return 'Other'
return p
df_cand.PARTY = df_cand.PARTY.apply(map_parties)
To visualize candidate distribution, per party and per office, group by party, and count in each group, number of Senate candidates. Then unstack the result, to flip the table, so party column, becomes feature labels. To plot “transformed” bar plot, we use kind=barh
parameter value.
df_cand.groupby('PARTY').CANDIDATE_OFFICE.value_counts().unstack(0).plot(kind='barh')
For simplicity, delete all candidates with PARTY different than Republican or Democrat.
Contribution data
Contribution data contains all transaction related to election fundraising. Data is separated to contribution from committees and contribution from individuals and organizations. We will use the Contributions by individuals data files.
Data files are big, so for this demonstration, I sampled from each file 1% of its content, and concatenate it into one data frame, that can be downloaded from the git.
- Raw files here: https://www.fec.gov/data/advanced/?tab=bulk-data
- The header for those files is kept in a separate csv file. Feature description here.
We will focus on the following features, Data and Amount of transaction.
sam_dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d')
df_contr = pd.read_csv('sampled_itcont_2018.csv'),
parse_dates=['TRANSACTION_DT'], date_parser=sam_dateparse)
Committee data
Contribution data is organized by committee, so we need to link each committee to the suitable candidate. The file contains a lot of information, we will read only the id columns.
df_committee = pd.read_csv('committee_summary_2018.csv',
usecols=['CMTE_ID', 'CAND_ID'])
Merge contributor data with committee data to associate each contribution with candidate and party. Read more.
df_merge = pd.merge(df_contr, df_committee, on=’CMTE_ID’)
df_merge.head()
Merge the result with candidate data, to link the contribution to a party. The candidate id feature, has different column names in both data frames, so we need to use left_on
and right_on
merge parameters to state those column names.
df_merge = pd.merge(df_merge, df_cand.set_index('CANDIDATE_ID'),
left_on='CAND_ID', right_index=True)
We can ask many questions on the data, like:
- What countries contribute the biggest contribution ?
df_big_cont=df_merge[df_merge.Amount>7000]
ax = df_big_cont.groupby(['PARTY', 'STATE']).Amount.sum().unstack(0).plot.barh(figsize=(15,7))
- Is employer correlated with Party ? Plot transaction sum per party of the 10 largest values in Employer column.
df_big_cont=df_merge[df_merge.Amount>7000]
ax = df_big_cont.groupby(['EMPLOYER', 'PARTY']).Amount.sum().unstack(0)[0:10].plot.barh()
Explore Contribution Amount Time series
To plot daily contribution amount time series, Group contribution by date, and sum.
for n in df_merge.PARTY.unique():
df_merge[df_merge['PARTY']==n].groupby('Date').Amount.sum().plot()
Visualize Trend
To visualize trend, we plot the rolling mean of the transaction amount, summed daily.
df_merge[df_merge.PARTY_CODE=='DEM'].groupby('Date').Amount.sum().rolling(8).mean().plot()
df_merge[df_merge.PARTY_CODE=='REP'].groupby('Date').Amount.sum().rolling(8).mean().plot()
Visualize Seasonality — First difference
To visualize seasonality, we remove the trend using Pandas method to compute and plot the first order difference of the contribution amount.
for n in df_merge.PARTY.unique():
df_merge[df_merge['PARTY']==n].groupby('Date').Amount.sum().diff().plot()
Next Steps
Many aggregation and exploration can be done on this data.
An interesting project will be to download historical contribution data from previous election cycle, and look for features correlated with winning the elections, to try and predict, the current winners in the upcoming election.
Full source on my Github.