Oh How I Will Miss the Tidyverse

I think the tidyverse is wonderful. I think it is easy, expressive, and legible. Not to dump on python, but I don't think the same is true for pandas. But regardless some people insist on using it. So if you are an avid R user, this guide is for you.

The dplyr Equivalents

DescriptionRPython
Select columnsselect(x,y,z)df[["x", "y", "z"]]
Filter rowsfilter(col == 'xyz')df[df["col"]=='xyz']
New columnmutate(col = x + y + z)df["col"] = df["x"] + df["y"] + df["z"]
New conditional columnmutate(col = ifelse(cond, 1, 0))df["col"] = np.where(df["col"] == "cond", 1, 0])
Drop columnselect(-col)df.drop(columns=["col"])
Sort rowsarrange()df.sort_values()
Group Bygroup_by()%>%summarise()df.groupby(["col"]).sum()

The tidyr Equivalents

DescriptionRPython
Make widespread()pd.pivot(df, index= "row", columns='cols', values='val')
Make longmelt()pd.melt(df, id_vars='x', value_vars=['y', 'z'], var_name='foo', value_name='bar')

 

# lets start by loading library and data
import pandas as pd
import numpy as np
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")

select(): Selecting a Column

Selecting a column is mostly done through dictionaries in python.

df[["Name", "HP"]][0:3]
Name HP
0 Bulbasaur 45
1 Ivysaur 60
2 Venusaur 80

However, with great effort, you can get columns via indexing.

# create a list of the columns
cols = list(df.columns.values)

# index that list
# NOTE: if you are selecting a single column you have to surround it with brackets to coerce it into a list
df[cols[0:3] + [cols[-1]] + cols[3:12]]
# Name Type 1 Legendary Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass False Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass False Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass False Poison 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass False Poison 80 100 123 122 120 80 1 False
4 4 Charmander Fire False NaN 39 52 43 60 50 65 1 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock True Fairy 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock True Fairy 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic True Ghost 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic True Dark 80 160 60 170 130 80 6 True
799 721 Volcanion Fire True Water 80 110 120 130 90 70 6 True

800 rows × 13 columns

filter(): Filtering Data

The first method for filtering data is to use df.loc[df[col]=='condition']

# one for a character filter
df[df["Type 1"] == "Grass"]
# one for a number filter
df[df["HP"] > 100]
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
44 39 Jigglypuff Normal Fairy 115 45 20 45 25 20 1 False
45 40 Wigglytuff Normal Fairy 140 70 45 85 50 45 1 False
96 89 Muk Poison NaN 105 105 75 65 100 50 1 False
120 112 Rhydon Ground Rock 105 130 120 45 45 40 1 False
121 113 Chansey Normal NaN 250 5 5 35 105 50 1 False
... ... ... ... ... ... ... ... ... ... ... ... ...
753 683 Aromatisse Fairy NaN 101 72 72 99 89 29 6 False
769 699 Aurorus Rock Ice 123 77 72 99 92 58 6 False
792 716 Xerneas Fairy NaN 126 131 95 131 98 99 6 True
793 717 Yveltal Dark Flying 126 131 95 131 98 99 6 True
794 718 Zygarde50% Forme Dragon Ground 108 100 121 81 95 95 6 True

67 rows × 12 columns

mutate(): Creating New Columns

This is the equivalent of mutate() in R.

The first method is more equivalent to df$new = df$old1 + df$old2 in R

# method 1
df["total"] = df["HP"] + df["Attack"] + df["Defense"] + df["Speed"] + df["Sp. Atk"] + df["Sp. Def"]

df.head(2)
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary total
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False 318
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False 405

The iloc[] Method to Sum Based on Interger Location

The second method using iloc[] to use the interger locations of the columns to sum them all together

# note how i put the : in for the rows
df["total_iloc"]= df.iloc[:,4:10].sum(axis=1)

df.head(2)
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary total total_iloc
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False 318 318
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False 405 405

Deleting a Column

This is the equivalent of select(-col) in R

df = df.drop(columns=["total", "total_iloc"])

ifelse(): to replace a column

This is the first time where I feel the need to dip into numpy to save the day. The where() clause works very similarly to ifelse() in R. Importantly, you can set your default value to be original column.

df["new"] = np.where(df["Type 1"] == "Fire", "Flamer", df["Type 1"])
df
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary new
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False Grass
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False Grass
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False Grass
3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False Grass
4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False Flamer
... ... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 50 100 150 100 150 50 6 True Rock
796 719 DiancieMega Diancie Rock Fairy 50 160 110 160 110 110 6 True Rock
797 720 HoopaHoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True Psychic
798 720 HoopaHoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True Psychic
799 721 Volcanion Fire Water 80 110 120 130 90 70 6 True Flamer

800 rows × 13 columns

arrange(): Sorting Data

The equivalent for arrange() is sort_values(). You can sort by either ascending or descending by changing a boolean (by default it is true).

A note on booleans: in python, they are written out as True and False. You can also use numbers with True = 1 and False = 0.

# sorting numerics
df.sort_values("HP")
df.sort_values("HP", ascending=False)

# sorting characters
df.sort_values("Name")

# sorting by multiple things
df.sort_values(["Type 1", "Type 2", "HP"], ascending=[1,1,0])
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary new
657 596 Galvantula Bug Electric 70 77 60 97 60 108 5 False Bug
656 595 Joltik Bug Electric 50 47 50 57 50 65 5 False Bug
231 214 Heracross Bug Fighting 80 125 75 40 95 85 2 False Bug
232 214 HeracrossMega Heracross Bug Fighting 80 185 115 40 105 75 2 False Bug
698 637 Volcarona Bug Fire 85 60 65 135 105 100 5 False Bug
... ... ... ... ... ... ... ... ... ... ... ... ... ...
106 98 Krabby Water NaN 30 105 90 25 25 50 1 False Water
125 116 Horsea Water NaN 30 40 70 70 25 60 1 False Water
129 120 Staryu Water NaN 30 45 55 70 55 85 1 False Water
139 129 Magikarp Water NaN 20 10 55 15 20 80 1 False Water
381 349 Feebas Water NaN 20 15 20 10 55 80 3 False Water

800 rows × 13 columns

group_by()%>%summarise(): Aggregating Data

A common task in data manipulation is to aggregate data based on a particular variable or two. See below for how to do that for some common operations.

# to get the count
df.groupby(['Type 1']).count()
# to get the mean
df.groupby(['Type 1']).mean()
# to get the median
df.groupby(['Type 1']).median()
# HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
Type 1
Bug 291.0 60.0 65.0 60.0 50.0 60.0 60.0 3.0 0.0
Dark 509.0 65.0 88.0 70.0 65.0 65.0 70.0 5.0 0.0
Dragon 443.5 80.0 113.5 90.0 105.0 90.0 90.0 4.0 0.0
Electric 403.5 60.0 65.0 65.0 95.0 79.5 88.0 4.0 0.0
Fairy 669.0 78.0 52.0 66.0 75.0 79.0 45.0 6.0 0.0
Fighting 308.0 70.0 100.0 70.0 40.0 63.0 60.0 3.0 0.0
Fire 289.5 70.0 84.5 64.0 85.0 67.5 78.5 3.0 0.0
Flying 677.5 79.0 85.0 75.0 103.5 80.0 116.0 5.5 0.5
Ghost 487.0 59.5 66.0 72.5 65.0 75.0 60.5 4.0 0.0
Grass 372.0 65.5 70.0 66.0 75.0 66.0 58.5 3.5 0.0
Ground 363.5 75.0 85.0 84.5 47.5 62.5 65.0 3.0 0.0
Ice 371.5 70.0 67.0 75.0 77.5 70.0 62.0 3.0 0.0
Normal 296.5 70.0 70.5 60.0 50.0 60.5 71.0 3.0 0.0
Poison 139.5 67.5 74.0 67.0 60.0 60.5 62.5 1.5 0.0
Psychic 386.0 68.0 57.0 65.0 95.0 90.0 80.0 3.0 0.0
Rock 362.5 68.5 95.0 100.0 60.0 70.0 50.0 3.0 0.0
Steel 379.0 60.0 89.0 120.0 55.0 80.0 50.0 3.0 0.0
Water 275.0 70.0 72.0 70.0 70.0 65.0 65.0 3.0 0.0

A quirk that oftentimes has to be dealt with (if you are planning on doing further manipulations) is that whatever variable you grouped on becomes the index instead of another varaible. As such, you need to reset the index after grouping.

Note: you dont need to reassign the df when you reset the index. It does it automatically

# first create the grouped by df
grouped = df.groupby(['Type 1']).sum()
# then reset the index 
grouped.reset_index(inplace = True)
# no need to reassign
grouped
Type 1 # HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 Bug 23080 3925 4897 4880 3717 4471 4256 222 0
1 Dark 14302 2071 2740 2177 2314 2155 2361 125 2
2 Dragon 15180 2666 3588 2764 3099 2843 2657 124 12
3 Electric 15994 2631 3040 2917 3961 3243 3718 144 4
4 Fairy 7642 1260 1046 1117 1335 1440 826 70 1
5 Fighting 9824 1886 2613 1780 1434 1747 1784 91 0
6 Fire 17025 3635 4408 3524 4627 3755 3871 167 5
7 Flying 2711 283 315 265 377 290 410 22 2
8 Ghost 15568 2062 2361 2598 2539 2447 2059 134 2
9 Grass 24141 4709 5125 4956 5425 4930 4335 235 3
10 Ground 11401 2361 3064 2715 1807 2008 2045 101 4
11 Ice 10165 1728 1746 1714 1861 1831 1523 85 2
12 Normal 31279 7573 7200 5865 5470 6245 7012 299 2
13 Poison 7050 1883 2091 1927 1692 1803 1780 71 0
14 Psychic 21706 4026 4073 3858 5609 4918 4645 193 14
15 Rock 17280 2876 4086 4435 2787 3321 2460 152 4
16 Steel 11957 1761 2503 3412 1823 2177 1492 104 4
17 Water 33946 8071 8305 8170 8379 7898 7388 320 4
# you can also do this for multiple groups
grouped = df.groupby(["Type 1", "Type 2"]).mean()
grouped.reset_index(inplace = True)
grouped
Type 1 Type 2 # HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 Bug Electric 595.500000 60.000000 62.000000 55.000000 77.000000 55.000000 86.500000 5.000000 0.0
1 Bug Fighting 214.000000 80.000000 155.000000 95.000000 40.000000 100.000000 80.000000 2.000000 0.0
2 Bug Fire 636.500000 70.000000 72.500000 60.000000 92.500000 80.000000 80.000000 5.000000 0.0
3 Bug Flying 286.285714 63.000000 70.142857 61.571429 72.857143 69.071429 82.857143 2.857143 0.0
4 Bug Ghost 292.000000 1.000000 90.000000 45.000000 30.000000 30.000000 40.000000 3.000000 0.0
... ... ... ... ... ... ... ... ... ... ... ...
131 Water Ice 103.000000 90.000000 83.333333 113.333333 80.000000 78.333333 66.666667 1.000000 0.0
132 Water Poison 118.666667 61.666667 68.333333 58.333333 61.666667 91.666667 85.000000 1.333333 0.0
133 Water Psychic 111.800000 87.000000 73.000000 104.000000 94.000000 79.000000 44.000000 1.200000 0.0
134 Water Rock 430.000000 70.750000 82.750000 112.750000 61.500000 65.000000 36.000000 3.750000 0.0
135 Water Steel 395.000000 84.000000 86.000000 88.000000 111.000000 101.000000 60.000000 4.000000 0.0

136 rows × 11 columns

spread(): Reshape from Long to Wide

We are unforunately going to have to step away from our wonderful pokemon dataset for this last task, reshaping data. For this example, we are going to create a mock dataset which shows points per NBA team per year. We want to transform this so that the years are columns.

#create DataFrame in long format
df = pd.DataFrame({'year': ['2011', '2011', '2011', '2011', '2012', '2012', '2012', '2012'],
                   'team': ['Celtics', 'Knicks', 'Lakers', 'Bulls', 'Celtics', 'Knicks', 'Lakers', 'Bulls'],
                   'points': [11, 8, 10, 6, 12, 5, 9, 4]})
df
year team points
0 2011 Celtics 11
1 2011 Knicks 8
2 2011 Lakers 10
3 2011 Bulls 6
4 2012 Celtics 12
5 2012 Knicks 5
6 2012 Lakers 9
7 2012 Bulls 4

First we are going to pivot it. This takes three arguements:

table = pd.pivot(df, index= "team", columns='year', values='points')
table
year 2011 2012
team
Bulls 6 4
Celtics 11 12
Knicks 8 5
Lakers 10 9

 

Finally, we have to clean up the indexes. First, we want to reset the index so that our two indexes become variables again. Next, we want to rename the first one so its the actual index again. We do this and we will have our data just how we wanted!

table = table.reset_index().rename_axis(None, axis=1)
table
team 2011 2012
0 Bulls 6 4
1 Celtics 11 12
2 Knicks 8 5
3 Lakers 10 9

melt(): Reshape from Wide to Long

Switching back from wide to long is actually pretty easy. All that is needed to identify what variable will remain as columns and what others will turn to values.

pd.melt(table, id_vars='team', value_vars=['2011', '2012'], var_name='year', value_name='amount')
team year amount
0 Bulls 2011 6
1 Celtics 2011 11
2 Knicks 2011 8
3 Lakers 2011 10
4 Bulls 2012 4
5 Celtics 2012 12
6 Knicks 2012 5
7 Lakers 2012 9