Advanced Filtering

Choosing what is in a data set is a fundamental operation of working with data. That all starts with filtering. We are going to be working with our pokemon dataset again to perform some more complicated filters.

Filter Equivalents

DescriptionRPython
A simple filterfilter()df[df['col'] == 'cond']
Multiple filtersfilter()df[ (df['col'] == 'cond1') & (df['col'] == 'cond2')]
Filter in a listfilter(x %in% y)df[df['col'].isin(list)]
Filter not in a listfilter(!x %in% y)df[~df['col'].isin(list)]
Drop duplicatesfilter(!duplicated())df.drop_duplicates()
Drop duplicated columnfilter(!duplicated())df.drop_duplicates(subset = ['a', 'c'])

 

import pandas as pd

  df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
  df.head(2)
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False

The Basic

Just as a review, here is how you do the most basic of filters:

df[df['Type 1'] == "Fire"].head(5)
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
4 4 Charmander Fire NaN 39 52 43 60 50 65 1 False
5 5 Charmeleon Fire NaN 58 64 58 80 65 80 1 False
6 6 Charizard Fire Flying 78 84 78 109 85 100 1 False
7 6 CharizardMega Charizard X Fire Dragon 78 130 111 130 85 100 1 False
8 6 CharizardMega Charizard Y Fire Flying 78 104 78 159 115 100 1 False

Multiple Conditions

You can string together multiple conditions. Just make sure you put them in parenthesis.

# dont forget to seperate with parenthesis

# OR
df[(df["Type 1"] == "Grass") | (df["Type 2"] == "Flying")]
# AND
df[(df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")]

# AND OR
new_df = df[((df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")) | (df["HP"] > 200)]
new_df
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
121 113 Chansey Normal NaN 250 5 5 35 105 50 1 False
202 187 Hoppip Grass Flying 35 35 40 35 55 50 2 False
203 188 Skiploom Grass Flying 55 45 50 45 65 80 2 False
204 189 Jumpluff Grass Flying 75 55 70 55 95 110 2 False
261 242 Blissey Normal NaN 255 10 10 75 135 55 2 False
390 357 Tropius Grass Flying 99 68 83 72 87 51 3 False
551 492 ShayminSky Forme Grass Flying 100 103 75 120 75 127 4 True

%in%: Filtering within a List

The combo of %in% and !(%in%) are probably one of the most used logical operators that I use in R. Luckily the equivalent in python is super straight forward. Just use .isin()

# x is in y
df[df['Type 1'].isin(['Fire', 'Water'])].head(5)
# x is not in y
df[~df['Type 1'].isin(['Fire', 'Water'])].head(5)

# saving a list as something
ls = ['Bug', 'Water']
df[df['Type 1'].isin(ls)].head(5)

# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
9 7 Squirtle Water NaN 44 48 65 50 64 43 1 False
10 8 Wartortle Water NaN 59 63 80 65 80 58 1 False
11 9 Blastoise Water NaN 79 83 100 85 105 78 1 False
12 9 BlastoiseMega Blastoise Water NaN 79 103 120 135 115 78 1 False
13 10 Caterpie Bug NaN 45 30 35 20 20 45 1 False

!is.na(): Filtering Out NaN

Often the case when joining together different datasets, you will need to remove rows that have missing values. The .notnull() operator will help with just that:

# You will see that Squirtle and co. are missing now :(
df[df['Type 2'].notnull()]
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 3 VenusaurMega Venusaur Grass Poison 80 100 123 122 120 80 1 False
6 6 Charizard Fire Flying 78 84 78 109 85 100 1 False
... ... ... ... ... ... ... ... ... ... ... ... ...
795 719 Diancie Rock Fairy 50 100 150 100 150 50 6 True
796 719 DiancieMega Diancie Rock Fairy 50 160 110 160 110 110 6 True
797 720 HoopaHoopa Confined Psychic Ghost 80 110 60 150 130 70 6 True
798 720 HoopaHoopa Unbound Psychic Dark 80 160 60 170 130 80 6 True
799 721 Volcanion Fire Water 80 110 120 130 90 70 6 True

414 rows × 12 columns

Dropping Duplicates

I will admit, I think dropping duplicates is easier in pandas than in R with the df.drop_duplicates() function.

# first lets create a dataframe that has duplicates to drop
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A

You can drop all entirely duplicated rows with just df.drop_duplicates() but you can also drop duplicates based on certain columns using the subset argument. By default, it will keep the first row and drop the second:

df.drop_duplicates(subset=['A', 'C'])
A B C
0 foo 0 A
2 foo 1 B
3 bar 1 A

The keep arguement lets you decide if you want to keep the first or second duplicates or drop them all:

# keep the first (default)
df.drop_duplicates(subset=['A', 'C'], keep = 'first')
# keep the last
df.drop_duplicates(subset=['A', 'C'], keep = 'last')
# drop them all
df.drop_duplicates(subset=['A', 'C'], keep = False)
A B C
2 foo 1 B
3 bar 1 A