Oh How I Will Miss the Tidyverse
I think the tidyverse is wonderful. I think it is easy, expressive, and legible. Not to dump on python, but I don't think the same is true for pandas. But regardless some people insist on using it. So if you are an avid R user, this guide is for you.
The dplyr
Equivalents
Description | R | Python |
---|---|---|
Select columns | select(x,y,z) | df[["x", "y", "z"]] |
Filter rows | filter(col == 'xyz') | df[df["col"]=='xyz'] |
New column | mutate(col = x + y + z) | df["col"] = df["x"] + df["y"] + df["z"] |
New conditional column | mutate(col = ifelse(cond, 1, 0)) | df["col"] = np.where(df["col"] == "cond", 1, 0]) |
Drop column | select(-col) | df.drop(columns=["col"]) |
Sort rows | arrange() | df.sort_values() |
Group By | group_by()%>%summarise() | df.groupby(["col"]).sum() |
The tidyr
Equivalents
Description | R | Python |
---|---|---|
Make wide | spread() | pd.pivot(df, index= "row", columns='cols', values='val') |
Make long | melt() | pd.melt(df, id_vars='x', value_vars=['y', 'z'], var_name='foo', value_name='bar') |
# lets start by loading library and data
import pandas as pd
import numpy as np
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
select()
: Selecting a Column
Selecting a column is mostly done through dictionaries in python.
df[["Name", "HP"]][0:3]
Name | HP | |
---|---|---|
0 | Bulbasaur | 45 |
1 | Ivysaur | 60 |
2 | Venusaur | 80 |
However, with great effort, you can get columns via indexing.
# create a list of the columns
cols = list(df.columns.values)
# index that list
# NOTE: if you are selecting a single column you have to surround it with brackets to coerce it into a list
df[cols[0:3] + [cols[-1]] + cols[3:12]]
# | Name | Type 1 | Legendary | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | False | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | False | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | False | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | False | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | False | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | True | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | True | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | True | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | True | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | True | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
800 rows × 13 columns
filter()
: Filtering Data
The first method for filtering data is to use df.loc[df[col]=='condition']
- put string conditions in quotes
- and numeric conditions without quotes
# one for a character filter
df[df["Type 1"] == "Grass"]
# one for a number filter
df[df["HP"] > 100]
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
44 | 39 | Jigglypuff | Normal | Fairy | 115 | 45 | 20 | 45 | 25 | 20 | 1 | False |
45 | 40 | Wigglytuff | Normal | Fairy | 140 | 70 | 45 | 85 | 50 | 45 | 1 | False |
96 | 89 | Muk | Poison | NaN | 105 | 105 | 75 | 65 | 100 | 50 | 1 | False |
120 | 112 | Rhydon | Ground | Rock | 105 | 130 | 120 | 45 | 45 | 40 | 1 | False |
121 | 113 | Chansey | Normal | NaN | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
753 | 683 | Aromatisse | Fairy | NaN | 101 | 72 | 72 | 99 | 89 | 29 | 6 | False |
769 | 699 | Aurorus | Rock | Ice | 123 | 77 | 72 | 99 | 92 | 58 | 6 | False |
792 | 716 | Xerneas | Fairy | NaN | 126 | 131 | 95 | 131 | 98 | 99 | 6 | True |
793 | 717 | Yveltal | Dark | Flying | 126 | 131 | 95 | 131 | 98 | 99 | 6 | True |
794 | 718 | Zygarde50% Forme | Dragon | Ground | 108 | 100 | 121 | 81 | 95 | 95 | 6 | True |
67 rows × 12 columns
mutate()
: Creating New Columns
This is the equivalent of mutate()
in R.
The first method is more equivalent to df$new = df$old1 + df$old2
in R
# method 1
df["total"] = df["HP"] + df["Attack"] + df["Defense"] + df["Speed"] + df["Sp. Atk"] + df["Sp. Def"]
df.head(2)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | 318 |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | 405 |
The iloc[]
Method to Sum Based on Interger Location
The second method using iloc[]
to use the interger locations of the columns to sum them all together
- Note 1: if you want all the rows, put
:
in the row positions iniloc[]
- Note 2: set axis = 1 to sum horizontally (0 for vertically)
- Note 3: remember that you want that the first index = 0 and the last index isnt included so for 5:9 in R you would do 4:10 in python
# note how i put the : in for the rows
df["total_iloc"]= df.iloc[:,4:10].sum(axis=1)
df.head(2)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | total | total_iloc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | 318 | 318 |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | 405 | 405 |
Deleting a Column
This is the equivalent of select(-col)
in R
df = df.drop(columns=["total", "total_iloc"])
ifelse()
: to replace a column
This is the first time where I feel the need to dip into numpy
to save the day. The where()
clause works very similarly to ifelse()
in R. Importantly, you can set your default value to be original column.
df["new"] = np.where(df["Type 1"] == "Fire", "Flamer", df["Type 1"])
df
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False | Grass |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False | Grass |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False | Grass |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False | Grass |
4 | 4 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False | Flamer |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True | Rock |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True | Rock |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True | Psychic |
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True | Psychic |
799 | 721 | Volcanion | Fire | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True | Flamer |
800 rows × 13 columns
arrange()
: Sorting Data
The equivalent for arrange()
is sort_values()
. You can sort by either ascending or descending by changing a boolean (by default it is true).
A note on booleans: in python, they are written out as True and False. You can also use numbers with True = 1 and False = 0.
# sorting numerics
df.sort_values("HP")
df.sort_values("HP", ascending=False)
# sorting characters
df.sort_values("Name")
# sorting by multiple things
df.sort_values(["Type 1", "Type 2", "HP"], ascending=[1,1,0])
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
657 | 596 | Galvantula | Bug | Electric | 70 | 77 | 60 | 97 | 60 | 108 | 5 | False | Bug |
656 | 595 | Joltik | Bug | Electric | 50 | 47 | 50 | 57 | 50 | 65 | 5 | False | Bug |
231 | 214 | Heracross | Bug | Fighting | 80 | 125 | 75 | 40 | 95 | 85 | 2 | False | Bug |
232 | 214 | HeracrossMega Heracross | Bug | Fighting | 80 | 185 | 115 | 40 | 105 | 75 | 2 | False | Bug |
698 | 637 | Volcarona | Bug | Fire | 85 | 60 | 65 | 135 | 105 | 100 | 5 | False | Bug |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
106 | 98 | Krabby | Water | NaN | 30 | 105 | 90 | 25 | 25 | 50 | 1 | False | Water |
125 | 116 | Horsea | Water | NaN | 30 | 40 | 70 | 70 | 25 | 60 | 1 | False | Water |
129 | 120 | Staryu | Water | NaN | 30 | 45 | 55 | 70 | 55 | 85 | 1 | False | Water |
139 | 129 | Magikarp | Water | NaN | 20 | 10 | 55 | 15 | 20 | 80 | 1 | False | Water |
381 | 349 | Feebas | Water | NaN | 20 | 15 | 20 | 10 | 55 | 80 | 3 | False | Water |
800 rows × 13 columns
group_by()%>%summarise()
: Aggregating Data
A common task in data manipulation is to aggregate data based on a particular variable or two. See below for how to do that for some common operations.
# to get the count
df.groupby(['Type 1']).count()
# to get the mean
df.groupby(['Type 1']).mean()
# to get the median
df.groupby(['Type 1']).median()
# | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|
Type 1 | |||||||||
Bug | 291.0 | 60.0 | 65.0 | 60.0 | 50.0 | 60.0 | 60.0 | 3.0 | 0.0 |
Dark | 509.0 | 65.0 | 88.0 | 70.0 | 65.0 | 65.0 | 70.0 | 5.0 | 0.0 |
Dragon | 443.5 | 80.0 | 113.5 | 90.0 | 105.0 | 90.0 | 90.0 | 4.0 | 0.0 |
Electric | 403.5 | 60.0 | 65.0 | 65.0 | 95.0 | 79.5 | 88.0 | 4.0 | 0.0 |
Fairy | 669.0 | 78.0 | 52.0 | 66.0 | 75.0 | 79.0 | 45.0 | 6.0 | 0.0 |
Fighting | 308.0 | 70.0 | 100.0 | 70.0 | 40.0 | 63.0 | 60.0 | 3.0 | 0.0 |
Fire | 289.5 | 70.0 | 84.5 | 64.0 | 85.0 | 67.5 | 78.5 | 3.0 | 0.0 |
Flying | 677.5 | 79.0 | 85.0 | 75.0 | 103.5 | 80.0 | 116.0 | 5.5 | 0.5 |
Ghost | 487.0 | 59.5 | 66.0 | 72.5 | 65.0 | 75.0 | 60.5 | 4.0 | 0.0 |
Grass | 372.0 | 65.5 | 70.0 | 66.0 | 75.0 | 66.0 | 58.5 | 3.5 | 0.0 |
Ground | 363.5 | 75.0 | 85.0 | 84.5 | 47.5 | 62.5 | 65.0 | 3.0 | 0.0 |
Ice | 371.5 | 70.0 | 67.0 | 75.0 | 77.5 | 70.0 | 62.0 | 3.0 | 0.0 |
Normal | 296.5 | 70.0 | 70.5 | 60.0 | 50.0 | 60.5 | 71.0 | 3.0 | 0.0 |
Poison | 139.5 | 67.5 | 74.0 | 67.0 | 60.0 | 60.5 | 62.5 | 1.5 | 0.0 |
Psychic | 386.0 | 68.0 | 57.0 | 65.0 | 95.0 | 90.0 | 80.0 | 3.0 | 0.0 |
Rock | 362.5 | 68.5 | 95.0 | 100.0 | 60.0 | 70.0 | 50.0 | 3.0 | 0.0 |
Steel | 379.0 | 60.0 | 89.0 | 120.0 | 55.0 | 80.0 | 50.0 | 3.0 | 0.0 |
Water | 275.0 | 70.0 | 72.0 | 70.0 | 70.0 | 65.0 | 65.0 | 3.0 | 0.0 |
A quirk that oftentimes has to be dealt with (if you are planning on doing further manipulations) is that whatever variable you grouped on becomes the index instead of another varaible. As such, you need to reset the index after grouping.
Note: you dont need to reassign the df when you reset the index. It does it automatically
# first create the grouped by df
grouped = df.groupby(['Type 1']).sum()
# then reset the index
grouped.reset_index(inplace = True)
# no need to reassign
grouped
Type 1 | # | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Bug | 23080 | 3925 | 4897 | 4880 | 3717 | 4471 | 4256 | 222 | 0 |
1 | Dark | 14302 | 2071 | 2740 | 2177 | 2314 | 2155 | 2361 | 125 | 2 |
2 | Dragon | 15180 | 2666 | 3588 | 2764 | 3099 | 2843 | 2657 | 124 | 12 |
3 | Electric | 15994 | 2631 | 3040 | 2917 | 3961 | 3243 | 3718 | 144 | 4 |
4 | Fairy | 7642 | 1260 | 1046 | 1117 | 1335 | 1440 | 826 | 70 | 1 |
5 | Fighting | 9824 | 1886 | 2613 | 1780 | 1434 | 1747 | 1784 | 91 | 0 |
6 | Fire | 17025 | 3635 | 4408 | 3524 | 4627 | 3755 | 3871 | 167 | 5 |
7 | Flying | 2711 | 283 | 315 | 265 | 377 | 290 | 410 | 22 | 2 |
8 | Ghost | 15568 | 2062 | 2361 | 2598 | 2539 | 2447 | 2059 | 134 | 2 |
9 | Grass | 24141 | 4709 | 5125 | 4956 | 5425 | 4930 | 4335 | 235 | 3 |
10 | Ground | 11401 | 2361 | 3064 | 2715 | 1807 | 2008 | 2045 | 101 | 4 |
11 | Ice | 10165 | 1728 | 1746 | 1714 | 1861 | 1831 | 1523 | 85 | 2 |
12 | Normal | 31279 | 7573 | 7200 | 5865 | 5470 | 6245 | 7012 | 299 | 2 |
13 | Poison | 7050 | 1883 | 2091 | 1927 | 1692 | 1803 | 1780 | 71 | 0 |
14 | Psychic | 21706 | 4026 | 4073 | 3858 | 5609 | 4918 | 4645 | 193 | 14 |
15 | Rock | 17280 | 2876 | 4086 | 4435 | 2787 | 3321 | 2460 | 152 | 4 |
16 | Steel | 11957 | 1761 | 2503 | 3412 | 1823 | 2177 | 1492 | 104 | 4 |
17 | Water | 33946 | 8071 | 8305 | 8170 | 8379 | 7898 | 7388 | 320 | 4 |
# you can also do this for multiple groups
grouped = df.groupby(["Type 1", "Type 2"]).mean()
grouped.reset_index(inplace = True)
grouped
Type 1 | Type 2 | # | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Bug | Electric | 595.500000 | 60.000000 | 62.000000 | 55.000000 | 77.000000 | 55.000000 | 86.500000 | 5.000000 | 0.0 |
1 | Bug | Fighting | 214.000000 | 80.000000 | 155.000000 | 95.000000 | 40.000000 | 100.000000 | 80.000000 | 2.000000 | 0.0 |
2 | Bug | Fire | 636.500000 | 70.000000 | 72.500000 | 60.000000 | 92.500000 | 80.000000 | 80.000000 | 5.000000 | 0.0 |
3 | Bug | Flying | 286.285714 | 63.000000 | 70.142857 | 61.571429 | 72.857143 | 69.071429 | 82.857143 | 2.857143 | 0.0 |
4 | Bug | Ghost | 292.000000 | 1.000000 | 90.000000 | 45.000000 | 30.000000 | 30.000000 | 40.000000 | 3.000000 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
131 | Water | Ice | 103.000000 | 90.000000 | 83.333333 | 113.333333 | 80.000000 | 78.333333 | 66.666667 | 1.000000 | 0.0 |
132 | Water | Poison | 118.666667 | 61.666667 | 68.333333 | 58.333333 | 61.666667 | 91.666667 | 85.000000 | 1.333333 | 0.0 |
133 | Water | Psychic | 111.800000 | 87.000000 | 73.000000 | 104.000000 | 94.000000 | 79.000000 | 44.000000 | 1.200000 | 0.0 |
134 | Water | Rock | 430.000000 | 70.750000 | 82.750000 | 112.750000 | 61.500000 | 65.000000 | 36.000000 | 3.750000 | 0.0 |
135 | Water | Steel | 395.000000 | 84.000000 | 86.000000 | 88.000000 | 111.000000 | 101.000000 | 60.000000 | 4.000000 | 0.0 |
136 rows × 11 columns
spread()
: Reshape from Long to Wide
We are unforunately going to have to step away from our wonderful pokemon dataset for this last task, reshaping data. For this example, we are going to create a mock dataset which shows points per NBA team per year. We want to transform this so that the years are columns.
#create DataFrame in long format
df = pd.DataFrame({'year': ['2011', '2011', '2011', '2011', '2012', '2012', '2012', '2012'],
'team': ['Celtics', 'Knicks', 'Lakers', 'Bulls', 'Celtics', 'Knicks', 'Lakers', 'Bulls'],
'points': [11, 8, 10, 6, 12, 5, 9, 4]})
df
year | team | points | |
---|---|---|---|
0 | 2011 | Celtics | 11 |
1 | 2011 | Knicks | 8 |
2 | 2011 | Lakers | 10 |
3 | 2011 | Bulls | 6 |
4 | 2012 | Celtics | 12 |
5 | 2012 | Knicks | 5 |
6 | 2012 | Lakers | 9 |
7 | 2012 | Bulls | 4 |
First we are going to pivot it. This takes three arguements:
- index: what will form the basis of the rows? In this case the teams
- columns: what column is being transformed into multiple columns (long to wide)
- values: what values will populate the new data frame
table = pd.pivot(df, index= "team", columns='year', values='points')
table
year | 2011 | 2012 |
---|---|---|
team | ||
Bulls | 6 | 4 |
Celtics | 11 | 12 |
Knicks | 8 | 5 |
Lakers | 10 | 9 |
Finally, we have to clean up the indexes. First, we want to reset the index so that our two indexes become variables again. Next, we want to rename the first one so its the actual index again. We do this and we will have our data just how we wanted!
table = table.reset_index().rename_axis(None, axis=1)
table
team | 2011 | 2012 | |
---|---|---|---|
0 | Bulls | 6 | 4 |
1 | Celtics | 11 | 12 |
2 | Knicks | 8 | 5 |
3 | Lakers | 10 | 9 |
melt()
: Reshape from Wide to Long
Switching back from wide to long is actually pretty easy. All that is needed to identify what variable will remain as columns and what others will turn to values.
pd.melt(table, id_vars='team', value_vars=['2011', '2012'], var_name='year', value_name='amount')
team | year | amount | |
---|---|---|---|
0 | Bulls | 2011 | 6 |
1 | Celtics | 2011 | 11 |
2 | Knicks | 2011 | 8 |
3 | Lakers | 2011 | 10 |
4 | Bulls | 2012 | 4 |
5 | Celtics | 2012 | 12 |
6 | Knicks | 2012 | 5 |
7 | Lakers | 2012 | 9 |