Pandas GroupBy explained Step by Step

Naomi Fridman
4 min readNov 5, 2018

Group by is very useful pandas dataframe functions. This short article, explains the methodology, the output and various options and twiks.

Pandas GroupBy explained Step by Step

Group By: split-apply-combine

in many situations we want to split the data set into groups and do something with those groups. For example, we want to count number of survivors on the titanic, but we want to count male and female separately. For This is called “group by” process. Formally, by “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria.
  • Applying a function to each group independently.
  • Combining the results into a data structure, usually into DataFrame, or column which is Series.

Out of these, the split step is the most straightforward, usually we split by the different values of some feature.
In the apply step, we might wish to one of the following:

  • Aggregation functions like sum or count, statistical functions like mean or std and custom functions.
  • Transformation: perform some group-specific computations and return a “list” of values — like-indexed object. For example to Filling Nan’s within specific value to each group.
  • Filtration discard some groups. For example, discard groups with only a few members. Filter outliers based on the group sum or mean.
    Lets View examples for the above cases.As demonstration data, we use Titanic data. Download from kaggle, or use sample example from git.

Source:

Source code on my git.

Read CSV file

Read only the following features:

  • Survived: Survival (0 = no; 1 = yes)
  • Pclass: Passenger class (1 = first; 2 = second; 3 = third)
  • Sex: Sex
  • Age: Age
  • Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Group By: Splitting data into groups — groupby

With pandas, we could naturally group by columns values. For that we call:

groupby() function returns a GroupBy object.

print(type(df.groupby('Sex').mean()))
df.groupby('Sex').mean()

To resume sex as a column,use reset_index.

df.groupby('Sex').sum().reset_index().head()

We can plot the returned data frame.

df.groupby('Sex').sum().plot(kind='bar');

The split-apply return data frame

In the returned data frame:

  • the index are the values of the column by which we made the groupby.

We can access specific values in the returned data frame.

print('grouped.mean():', type(grouped.mean()))
print('grouped.mean().columns:',grouped.mean().columns)
print('grouped.mean().columns.values: ', grouped.mean().columns.values)
print('grouped.mean().index: ', grouped.mean().index)
print('grouped.mean()[\'Age\'][\'female\']:',grouped.mean()['Age']['female'])
out:
grouped.mean(): <class 'pandas.core.frame.DataFrame'>
grouped.mean().columns: Index(['Survived', 'Pclass', 'Age', 'SibSp'], dtype='object')
grouped.mean().columns.values: ['Survived' 'Pclass' 'Age' 'SibSp']
grouped.mean().index: Index(['female', 'male'], dtype='object', name='Sex')
grouped.mean()['Age']['female']: 27.915708812260537

impute missing age values with the groupby mean

Checking null values with info()command,show that Age 20% Nan values. To impute missing age values with the mean value specific to each Sex, do the following:

g_mean = df.groupby('Sex').mean()
df.loc[df.Age.isnull() & (df.Sex == 'female'),'Age'] = g_mean['Age']['female']
df.loc[df.Age.isnull() & (df.Sex == 'male'), 'Age'] = g_mean['Age']['male']

Group by more then one column

Lets view surviving rate per Sex. To to that,we groupby on Sex and Survive columns.

df.groupby(['Sex', 'Survived'] )['Survived'].count()Sex     Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64

We will get a data frame with 4 raws, each for any combination of the columns value. The index is a multi index of the combination of the unique values of the grouped by columns. The data frame has one column, with the count of rows, with those values.

df.groupby(['Sex', 'Survived'] )['Survived'].count().plot.bar(figsize=(8, 6));

Flatten after groupby

This is not what we wanted, we want to see the inner distribution of survivors in each sex group. To do that, we will flatten the data frame, using unstack pandas method.

unstack method turns index values into column names. Very roughly we can say that it transpose and aggregate the data frame.

In multi indexing, the index column to unstack, is passed as parameter. Can be given by name, or by position, 0 is the default.

df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count().unstack(‘Sex’)Sex       female  male
Survived
0 81 468
1 233 109

We can use index for the unstack column

df.groupby(['Sex', 'Survived'] )['Survived'].count().unstack(1)Survived    0    1
Sex
female 81 233
male 468 109

Now we can visualize survivors distribution, by sex group.

df_grouped.unstack(0).plot.bar(stacked=True, figsize=(8, 6));

Plot few other groupby count operations

df.groupby(['Embarked', 'Survived'] )['Survived'].count().unstack(1).plot.bar(figsize=(8, 6));
df.groupby(['Embarked', 'Pclass'] )['Pclass'].count().unstack(1).plot.bar(figsize=(8, 6));

--

--

Naomi Fridman

MSc. Mathematics. Data Scientist. Love Deep learning ,Machine learning , Mathematics and Surfing. https://github.com/naomifridman