Oh How I Will Miss the Tidyverse

I think the tidyverse is wonderful. I think it is easy, expressive, and legible. Not to dump on python, but I don't think the same is true for pandas. But regardless some people insist on using it. So if you are an avid R user, this guide is for you.

The dplyr Equivalents

Description	R	Python
Select columns	`select(x,y,z)`	`df[["x", "y", "z"]]`
Filter rows	`filter(col == 'xyz')`	`df[df["col"]=='xyz']`
New column	`mutate(col = x + y + z)`	`df["col"] = df["x"] + df["y"] + df["z"]`
New conditional column	`mutate(col = ifelse(cond, 1, 0))`	`df["col"] = np.where(df["col"] == "cond", 1, 0])`
Drop column	`select(-col)`	`df.drop(columns=["col"])`
Sort rows	`arrange()`	`df.sort_values()`
Group By	`group_by()%>%summarise()`	`df.groupby(["col"]).sum()`

The tidyr Equivalents

Description	R	Python
Make wide	`spread()`	`pd.pivot(df, index= "row", columns='cols', values='val')`
Make long	`melt()`	`pd.melt(df, id_vars='x', value_vars=['y', 'z'], var_name='foo', value_name='bar')`

# lets start by loading library and data
import pandas as pd
import numpy as np
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")

`select()`: Selecting a Column

Selecting a column is mostly done through dictionaries in python.

df[["Name", "HP"]][0:3]

	Name	HP
0	Bulbasaur	45
1	Ivysaur	60
2	Venusaur	80

However, with great effort, you can get columns via indexing.

# create a list of the columns
cols = list(df.columns.values)

# index that list
# NOTE: if you are selecting a single column you have to surround it with brackets to coerce it into a list
df[cols[0:3] + [cols[-1]] + cols[3:12]]

	#	Name	Type 1	Legendary	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	False	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	False	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	False	Poison	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	False	Poison	80	100	123	122	120	80	1	False
4	4	Charmander	Fire	False	NaN	39	52	43	60	50	65	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	True	Fairy	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	True	Fairy	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	True	Ghost	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	True	Dark	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	True	Water	80	110	120	130	90	70	6	True

800 rows × 13 columns

`filter()`: Filtering Data

The first method for filtering data is to use df.loc[df[col]=='condition']

put string conditions in quotes
and numeric conditions without quotes

# one for a character filter
df[df["Type 1"] == "Grass"]
# one for a number filter
df[df["HP"] > 100]

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
44	39	Jigglypuff	Normal	Fairy	115	45	20	45	25	20	1	False
45	40	Wigglytuff	Normal	Fairy	140	70	45	85	50	45	1	False
96	89	Muk	Poison	NaN	105	105	75	65	100	50	1	False
120	112	Rhydon	Ground	Rock	105	130	120	45	45	40	1	False
121	113	Chansey	Normal	NaN	250	5	5	35	105	50	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...
753	683	Aromatisse	Fairy	NaN	101	72	72	99	89	29	6	False
769	699	Aurorus	Rock	Ice	123	77	72	99	92	58	6	False
792	716	Xerneas	Fairy	NaN	126	131	95	131	98	99	6	True
793	717	Yveltal	Dark	Flying	126	131	95	131	98	99	6	True
794	718	Zygarde50% Forme	Dragon	Ground	108	100	121	81	95	95	6	True

67 rows × 12 columns

`mutate()`: Creating New Columns

This is the equivalent of mutate() in R.

The first method is more equivalent to df$new = df$old1 + df$old2 in R

# method 1
df["total"] = df["HP"] + df["Attack"] + df["Defense"] + df["Speed"] + df["Sp. Atk"] + df["Sp. Def"]

df.head(2)

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	total
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False	318
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False	405

The iloc[] Method to Sum Based on Interger Location

The second method using iloc[] to use the interger locations of the columns to sum them all together

Note 1: if you want all the rows, put : in the row positions in iloc[]
Note 2: set axis = 1 to sum horizontally (0 for vertically)
Note 3: remember that you want that the first index = 0 and the last index isnt included so for 5:9 in R you would do 4:10 in python

# note how i put the : in for the rows
df["total_iloc"]= df.iloc[:,4:10].sum(axis=1)

df.head(2)

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	total	total_iloc
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False	318	318
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False	405	405

Deleting a Column

This is the equivalent of select(-col) in R

df = df.drop(columns=["total", "total_iloc"])

`ifelse()`: to replace a column

This is the first time where I feel the need to dip into numpy to save the day. The where() clause works very similarly to ifelse() in R. Importantly, you can set your default value to be original column.

df["new"] = np.where(df["Type 1"] == "Fire", "Flamer", df["Type 1"])
df

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	new
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False	Grass
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False	Grass
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False	Grass
3	3	VenusaurMega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False	Grass
4	4	Charmander	Fire	NaN	39	52	43	60	50	65	1	False	Flamer
...	...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	50	100	150	100	150	50	6	True	Rock
796	719	DiancieMega Diancie	Rock	Fairy	50	160	110	160	110	110	6	True	Rock
797	720	HoopaHoopa Confined	Psychic	Ghost	80	110	60	150	130	70	6	True	Psychic
798	720	HoopaHoopa Unbound	Psychic	Dark	80	160	60	170	130	80	6	True	Psychic
799	721	Volcanion	Fire	Water	80	110	120	130	90	70	6	True	Flamer

800 rows × 13 columns

`arrange()`: Sorting Data

The equivalent for arrange() is sort_values(). You can sort by either ascending or descending by changing a boolean (by default it is true).

A note on booleans: in python, they are written out as True and False. You can also use numbers with True = 1 and False = 0.

# sorting numerics
df.sort_values("HP")
df.sort_values("HP", ascending=False)

# sorting characters
df.sort_values("Name")

# sorting by multiple things
df.sort_values(["Type 1", "Type 2", "HP"], ascending=[1,1,0])

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary	new
657	596	Galvantula	Bug	Electric	70	77	60	97	60	108	5	False	Bug
656	595	Joltik	Bug	Electric	50	47	50	57	50	65	5	False	Bug
231	214	Heracross	Bug	Fighting	80	125	75	40	95	85	2	False	Bug
232	214	HeracrossMega Heracross	Bug	Fighting	80	185	115	40	105	75	2	False	Bug
698	637	Volcarona	Bug	Fire	85	60	65	135	105	100	5	False	Bug
...	...	...	...	...	...	...	...	...	...	...	...	...	...
106	98	Krabby	Water	NaN	30	105	90	25	25	50	1	False	Water
125	116	Horsea	Water	NaN	30	40	70	70	25	60	1	False	Water
129	120	Staryu	Water	NaN	30	45	55	70	55	85	1	False	Water
139	129	Magikarp	Water	NaN	20	10	55	15	20	80	1	False	Water
381	349	Feebas	Water	NaN	20	15	20	10	55	80	3	False	Water

800 rows × 13 columns

`group_by()%>%summarise()`: Aggregating Data

A common task in data manipulation is to aggregate data based on a particular variable or two. See below for how to do that for some common operations.

# to get the count
df.groupby(['Type 1']).count()
# to get the mean
df.groupby(['Type 1']).mean()
# to get the median
df.groupby(['Type 1']).median()

	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
Type 1
Bug	291.0	60.0	65.0	60.0	50.0	60.0	60.0	3.0	0.0
Dark	509.0	65.0	88.0	70.0	65.0	65.0	70.0	5.0	0.0
Dragon	443.5	80.0	113.5	90.0	105.0	90.0	90.0	4.0	0.0
Electric	403.5	60.0	65.0	65.0	95.0	79.5	88.0	4.0	0.0
Fairy	669.0	78.0	52.0	66.0	75.0	79.0	45.0	6.0	0.0
Fighting	308.0	70.0	100.0	70.0	40.0	63.0	60.0	3.0	0.0
Fire	289.5	70.0	84.5	64.0	85.0	67.5	78.5	3.0	0.0
Flying	677.5	79.0	85.0	75.0	103.5	80.0	116.0	5.5	0.5
Ghost	487.0	59.5	66.0	72.5	65.0	75.0	60.5	4.0	0.0
Grass	372.0	65.5	70.0	66.0	75.0	66.0	58.5	3.5	0.0
Ground	363.5	75.0	85.0	84.5	47.5	62.5	65.0	3.0	0.0
Ice	371.5	70.0	67.0	75.0	77.5	70.0	62.0	3.0	0.0
Normal	296.5	70.0	70.5	60.0	50.0	60.5	71.0	3.0	0.0
Poison	139.5	67.5	74.0	67.0	60.0	60.5	62.5	1.5	0.0
Psychic	386.0	68.0	57.0	65.0	95.0	90.0	80.0	3.0	0.0
Rock	362.5	68.5	95.0	100.0	60.0	70.0	50.0	3.0	0.0
Steel	379.0	60.0	89.0	120.0	55.0	80.0	50.0	3.0	0.0
Water	275.0	70.0	72.0	70.0	70.0	65.0	65.0	3.0	0.0

A quirk that oftentimes has to be dealt with (if you are planning on doing further manipulations) is that whatever variable you grouped on becomes the index instead of another varaible. As such, you need to reset the index after grouping.

Note: you dont need to reassign the df when you reset the index. It does it automatically

# first create the grouped by df
grouped = df.groupby(['Type 1']).sum()
# then reset the index 
grouped.reset_index(inplace = True)
# no need to reassign
grouped

	Type 1	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	Bug	23080	3925	4897	4880	3717	4471	4256	222	0
1	Dark	14302	2071	2740	2177	2314	2155	2361	125	2
2	Dragon	15180	2666	3588	2764	3099	2843	2657	124	12
3	Electric	15994	2631	3040	2917	3961	3243	3718	144	4
4	Fairy	7642	1260	1046	1117	1335	1440	826	70	1
5	Fighting	9824	1886	2613	1780	1434	1747	1784	91	0
6	Fire	17025	3635	4408	3524	4627	3755	3871	167	5
7	Flying	2711	283	315	265	377	290	410	22	2
8	Ghost	15568	2062	2361	2598	2539	2447	2059	134	2
9	Grass	24141	4709	5125	4956	5425	4930	4335	235	3
10	Ground	11401	2361	3064	2715	1807	2008	2045	101	4
11	Ice	10165	1728	1746	1714	1861	1831	1523	85	2
12	Normal	31279	7573	7200	5865	5470	6245	7012	299	2
13	Poison	7050	1883	2091	1927	1692	1803	1780	71	0
14	Psychic	21706	4026	4073	3858	5609	4918	4645	193	14
15	Rock	17280	2876	4086	4435	2787	3321	2460	152	4
16	Steel	11957	1761	2503	3412	1823	2177	1492	104	4
17	Water	33946	8071	8305	8170	8379	7898	7388	320	4

# you can also do this for multiple groups
grouped = df.groupby(["Type 1", "Type 2"]).mean()
grouped.reset_index(inplace = True)
grouped

	Type 1	Type 2	#	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	Bug	Electric	595.500000	60.000000	62.000000	55.000000	77.000000	55.000000	86.500000	5.000000	0.0
1	Bug	Fighting	214.000000	80.000000	155.000000	95.000000	40.000000	100.000000	80.000000	2.000000	0.0
2	Bug	Fire	636.500000	70.000000	72.500000	60.000000	92.500000	80.000000	80.000000	5.000000	0.0
3	Bug	Flying	286.285714	63.000000	70.142857	61.571429	72.857143	69.071429	82.857143	2.857143	0.0
4	Bug	Ghost	292.000000	1.000000	90.000000	45.000000	30.000000	30.000000	40.000000	3.000000	0.0
...	...	...	...	...	...	...	...	...	...	...	...
131	Water	Ice	103.000000	90.000000	83.333333	113.333333	80.000000	78.333333	66.666667	1.000000	0.0
132	Water	Poison	118.666667	61.666667	68.333333	58.333333	61.666667	91.666667	85.000000	1.333333	0.0
133	Water	Psychic	111.800000	87.000000	73.000000	104.000000	94.000000	79.000000	44.000000	1.200000	0.0
134	Water	Rock	430.000000	70.750000	82.750000	112.750000	61.500000	65.000000	36.000000	3.750000	0.0
135	Water	Steel	395.000000	84.000000	86.000000	88.000000	111.000000	101.000000	60.000000	4.000000	0.0

136 rows × 11 columns

`spread()`: Reshape from Long to Wide

We are unforunately going to have to step away from our wonderful pokemon dataset for this last task, reshaping data. For this example, we are going to create a mock dataset which shows points per NBA team per year. We want to transform this so that the years are columns.

#create DataFrame in long format
df = pd.DataFrame({'year': ['2011', '2011', '2011', '2011', '2012', '2012', '2012', '2012'],
                   'team': ['Celtics', 'Knicks', 'Lakers', 'Bulls', 'Celtics', 'Knicks', 'Lakers', 'Bulls'],
                   'points': [11, 8, 10, 6, 12, 5, 9, 4]})
df

	year	team	points
0	2011	Celtics	11
1	2011	Knicks	8
2	2011	Lakers	10
3	2011	Bulls	6
4	2012	Celtics	12
5	2012	Knicks	5
6	2012	Lakers	9
7	2012	Bulls	4

First we are going to pivot it. This takes three arguements:

index: what will form the basis of the rows? In this case the teams
columns: what column is being transformed into multiple columns (long to wide)
values: what values will populate the new data frame

table = pd.pivot(df, index= "team", columns='year', values='points')
table

year	2011	2012
team
Bulls	6	4
Celtics	11	12
Knicks	8	5
Lakers	10	9

Finally, we have to clean up the indexes. First, we want to reset the index so that our two indexes become variables again. Next, we want to rename the first one so its the actual index again. We do this and we will have our data just how we wanted!

table = table.reset_index().rename_axis(None, axis=1)
table

	team	2011	2012
0	Bulls	6	4
1	Celtics	11	12
2	Knicks	8	5
3	Lakers	10	9

`melt()`: Reshape from Wide to Long

Switching back from wide to long is actually pretty easy. All that is needed to identify what variable will remain as columns and what others will turn to values.

pd.melt(table, id_vars='team', value_vars=['2011', '2012'], var_name='year', value_name='amount')

	team	year	amount
0	Bulls	2011	6
1	Celtics	2011	11
2	Knicks	2011	8
3	Lakers	2011	10
4	Bulls	2012	4
5	Celtics	2012	12
6	Knicks	2012	5
7	Lakers	2012	9

Oh How I Will Miss the Tidyverse

select(): Selecting a Column

filter(): Filtering Data

mutate(): Creating New Columns

Deleting a Column

ifelse(): to replace a column

arrange(): Sorting Data

group_by()%>%summarise(): Aggregating Data

spread(): Reshape from Long to Wide

melt(): Reshape from Wide to Long

`select()`: Selecting a Column

`filter()`: Filtering Data

`mutate()`: Creating New Columns

`ifelse()`: to replace a column

`arrange()`: Sorting Data

`group_by()%>%summarise()`: Aggregating Data

`spread()`: Reshape from Long to Wide

`melt()`: Reshape from Wide to Long