Advanced Filtering

Choosing what is in a data set is a fundamental operation of working with data. That all starts with filtering. We are going to be working with our pokemon dataset again to perform some more complicated filters.

Filter Equivalents

Description	R	Python
A simple filter	`filter()`	`df[df['col'] == 'cond']`
Multiple filters	`filter()`	`df[ (df['col'] == 'cond1') & (df['col'] == 'cond2')]`
Filter in a list	`filter(x %in% y)`	`df[df['col'].isin(list)]`
Filter not in a list	`filter(!x %in% y)`	`df[~df['col'].isin(list)]`
Drop duplicates	`filter(!duplicated())`	`df.drop_duplicates()`
Drop duplicated column	`filter(!duplicated())`	`df.drop_duplicates(subset = ['a', 'c'])`

import pandas as pd

  df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
  df.head(2)

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False

The Basic

Just as a review, here is how you do the most basic of filters:

df[df['Type 1'] == "Fire"].head(5)

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
4	4	Charmander	Fire	NaN	39	52	43	60	50	65	1	False
5	5	Charmeleon	Fire	NaN	58	64	58	80	65	80	1	False
6	6	Charizard	Fire	Flying	78	84	78	109	85	100	1	False
7	6	CharizardMega Charizard X	Fire	Dragon	78	130	111	130	85	100	1	False
8	6	CharizardMega Charizard Y	Fire	Flying	78	104	78	159	115	100	1	False

Multiple Conditions

You can string together multiple conditions. Just make sure you put them in parenthesis.

# dont forget to seperate with parenthesis

# OR
df[(df["Type 1"] == "Grass") | (df["Type 2"] == "Flying")]
# AND
df[(df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")]

# AND OR
new_df = df[((df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")) | (df["HP"] > 200)]
new_df

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
121	113	Chansey	Normal	NaN	250	5	5	35	105	50	1	False
202	187	Hoppip	Grass	Flying	35	35	40	35	55	50	2	False
203	188	Skiploom	Grass	Flying	55	45	50	45	65	80	2	False
204	189	Jumpluff	Grass	Flying	75	55	70	55	95	110	2	False
261	242	Blissey	Normal	NaN	255	10	10	75	135	55	2	False
390	357	Tropius	Grass	Flying	99	68	83	72	87	51	3	False
551	492	ShayminSky Forme	Grass	Flying	100	103	75	120	75	127	4	True

`%in%`: Filtering within a List

The combo of %in% and !(%in%) are probably one of the most used logical operators that I use in R. Luckily the equivalent in python is super straight forward. Just use .isin()

# x is in y
df[df['Type 1'].isin(['Fire', 'Water'])].head(5)
# x is not in y
df[~df['Type 1'].isin(['Fire', 'Water'])].head(5)

# saving a list as something
ls = ['Bug', 'Water']
df[df['Type 1'].isin(ls)].head(5)

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
9	7	Squirtle	Water	NaN	44	48	65	50	64	43	1	False
10	8	Wartortle	Water	NaN	59	63	80	65	80	58	1	False
11	9	Blastoise	Water	NaN	79	83	100	85	105	78	1	False
12	9	BlastoiseMega Blastoise	Water	NaN	79	103	120	135	115	78	1	False
13	10	Caterpie	Bug	NaN	45	30	35	20	20	45	1	False

`!is.na()`: Filtering Out `NaN`

Often the case when joining together different datasets, you will need to remove rows that have missing values. The .notnull() operator will help with just that:

# You will see that Squirtle and co. are missing now :(
df[df['Type 2'].notnull()]

	#	Name	Type 1	Type 2	HP	Attack	Defense	Sp. Atk	Sp. Def	Speed	Generation	Legendary
0	1	Bulbasaur	Grass	Poison	45	49	49	65	65	45	1	False
1	2	Ivysaur	Grass	Poison	60	62	63	80	80	60	1	False
2	3	Venusaur	Grass	Poison	80	82	83	100	100	80	1	False
3	3	VenusaurMega Venusaur	Grass	Poison	80	100	123	122	120	80	1	False
6	6	Charizard	Fire	Flying	78	84	78	109	85	100	1	False
...	...	...	...	...	...	...	...	...	...	...	...	...
795	719	Diancie	Rock	Fairy	50	100	150	100	150	50	6	True
796	719	DiancieMega Diancie	Rock	Fairy	50	160	110	160	110	110	6	True
797	720	HoopaHoopa Confined	Psychic	Ghost	80	110	60	150	130	70	6	True
798	720	HoopaHoopa Unbound	Psychic	Dark	80	160	60	170	130	80	6	True
799	721	Volcanion	Fire	Water	80	110	120	130	90	70	6	True

414 rows × 12 columns

Dropping Duplicates

I will admit, I think dropping duplicates is easier in pandas than in R with the df.drop_duplicates() function.

# first lets create a dataframe that has duplicates to drop
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df

	A	B	C
0	foo	0	A
1	foo	1	A
2	foo	1	B
3	bar	1	A

You can drop all entirely duplicated rows with just df.drop_duplicates() but you can also drop duplicates based on certain columns using the subset argument. By default, it will keep the first row and drop the second:

df.drop_duplicates(subset=['A', 'C'])

	A	B	C
0	foo	0	A
2	foo	1	B
3	bar	1	A

The keep arguement lets you decide if you want to keep the first or second duplicates or drop them all:

# keep the first (default)
df.drop_duplicates(subset=['A', 'C'], keep = 'first')
# keep the last
df.drop_duplicates(subset=['A', 'C'], keep = 'last')
# drop them all
df.drop_duplicates(subset=['A', 'C'], keep = False)

	A	B	C
2	foo	1	B
3	bar	1	A