Advanced Group Bys

Group Bys are one of the most important tools to transforming data. I would go as far as to say that group by may be the most important function in dplyr. I might say that, but honestly I would be exaggerating. Luckily, no one is going to read this intro so it does not matter.

DescriptionRPython
Simple group bygroup_by("grp")df.groupby(['grp']).sum()
One summary column group_by("grp")df.groupby(['grp'])['col2'].sum()
Two summary columngroup_by("grp")df.groupby(['grp'])[['col', 'col2']].sum()
Two grouping columnsgroup_by("grp", "grp2")df.groupby(['grp', 'grp2'])['col'].sum()
Reset IndexNot Neededgrouped.reset_index()

Basic Group Bys

We are gonna hit this rapid fire since there isn't too much to discuss here. First load that Pokemon dataset from github

df = pd.read_csv("http://www.corydonbaylor.io/pokemon")

First we are going to take the simplest of group bys. If we don't specify what columns we want to summarize, than it will summarize all the numeric columns:

# group everything
df.groupby(["Type 1"]).sum()
              #  ... Legendary
Type 1           ...          
Bug       23080  ...         0
Dark      14302  ...         2
Dragon    15180  ...        12
Electric  15994  ...         4
...

Next we will specify a column to summarize:

# group by one column
df.groupby(["Type 1"])["Attack"].sum()
Type 1
Bug         4897
Dark        2740
Dragon      3588
Electric    3040
...

Two Summary Columns

What if we want to summarize two columns, like attack and defense. We need to put both of those columns in double brackets:

# group by two columns
df.groupby(["Type 1"])[["Attack", "Defense"]].sum()
         Attack  Defense
Type 1                   
Bug         4897     4880
Dark        2740     2177
Dragon      3588     2764
Electric    3040     2917
...

Two Grouping Columns

How about two grouping columns:

# group by two grouping columns
df.groupby(["Type 1", "Type 2"])[["Attack"]].sum()
                 Attack
Type 1 Type 2          
Bug    Electric     124
       Fighting     310
       Fire         145
       Flying       982
       Ghost         90
...                 ...
Water  Ice          250
       Poison       205
       Psychic      365
       Rock         331
       Steel         86
...

Reset the index!

You may have noticed that this is no longer in a format that you can apply more pandas operations on. You have to reset the index for that (this is one of the many reasons that I prefer R).

# reset index so this is useful
df.groupby(["Type 1", "Type 2"])[["Attack"]].sum().reset_index()
    Type 1    Type 2  Attack
0      Bug  Electric     124
1      Bug  Fighting     310
2      Bug      Fire     145
3      Bug    Flying     982
4      Bug     Ghost      90
..     ...       ...     ...
131  Water       Ice     250
...