Advanced Group Bys
Group Bys are one of the most important tools to transforming data. I would go as far as to say that group by may be the most important function in dplyr
. I might say that, but honestly I would be exaggerating. Luckily, no one is going to read this intro so it does not matter.
Description | R | Python |
---|---|---|
Simple group by | group_by("grp") | df.groupby(['grp']).sum() |
One summary column | group_by("grp") | df.groupby(['grp'])['col2'].sum() |
Two summary column | group_by("grp") | df.groupby(['grp'])[['col', 'col2']].sum() |
Two grouping columns | group_by("grp", "grp2") | df.groupby(['grp', 'grp2'])['col'].sum() |
Reset Index | Not Needed | grouped.reset_index() |
Basic Group Bys
We are gonna hit this rapid fire since there isn't too much to discuss here. First load that Pokemon dataset from github
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
First we are going to take the simplest of group bys. If we don't specify what columns we want to summarize, than it will summarize all the numeric columns:
# group everything
df.groupby(["Type 1"]).sum()
# ... Legendary
Type 1 ...
Bug 23080 ... 0
Dark 14302 ... 2
Dragon 15180 ... 12
Electric 15994 ... 4
...
Next we will specify a column to summarize:
# group by one column
df.groupby(["Type 1"])["Attack"].sum()
Type 1
Bug 4897
Dark 2740
Dragon 3588
Electric 3040
...
Two Summary Columns
What if we want to summarize two columns, like attack and defense. We need to put both of those columns in double brackets:
# group by two columns
df.groupby(["Type 1"])[["Attack", "Defense"]].sum()
Attack Defense
Type 1
Bug 4897 4880
Dark 2740 2177
Dragon 3588 2764
Electric 3040 2917
...
Two Grouping Columns
How about two grouping columns:
# group by two grouping columns
df.groupby(["Type 1", "Type 2"])[["Attack"]].sum()
Attack
Type 1 Type 2
Bug Electric 124
Fighting 310
Fire 145
Flying 982
Ghost 90
... ...
Water Ice 250
Poison 205
Psychic 365
Rock 331
Steel 86
...
Reset the index!
You may have noticed that this is no longer in a format that you can apply more pandas
operations on. You have to reset the index for that (this is one of the many reasons that I prefer R).
# reset index so this is useful
df.groupby(["Type 1", "Type 2"])[["Attack"]].sum().reset_index()
Type 1 Type 2 Attack
0 Bug Electric 124
1 Bug Fighting 310
2 Bug Fire 145
3 Bug Flying 982
4 Bug Ghost 90
.. ... ... ...
131 Water Ice 250
...