Advanced Filtering
Choosing what is in a data set is a fundamental operation of working with data. That all starts with filtering. We are going to be working with our pokemon dataset again to perform some more complicated filters.
Filter Equivalents
Description | R | Python |
---|---|---|
A simple filter | filter() | df[df['col'] == 'cond'] |
Multiple filters | filter() | df[ (df['col'] == 'cond1') & (df['col'] == 'cond2')] |
Filter in a list | filter(x %in% y) | df[df['col'].isin(list)] |
Filter not in a list | filter(!x %in% y) | df[~df['col'].isin(list)] |
Drop duplicates | filter(!duplicated()) | df.drop_duplicates() |
Drop duplicated column | filter(!duplicated()) | df.drop_duplicates(subset = ['a', 'c']) |
import pandas as pd
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
df.head(2)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
The Basic
Just as a review, here is how you do the most basic of filters:
df[df['Type 1'] == "Fire"].head(5)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 4 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
5 | 5 | Charmeleon | Fire | NaN | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
6 | 6 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
7 | 6 | CharizardMega Charizard X | Fire | Dragon | 78 | 130 | 111 | 130 | 85 | 100 | 1 | False |
8 | 6 | CharizardMega Charizard Y | Fire | Flying | 78 | 104 | 78 | 159 | 115 | 100 | 1 | False |
Multiple Conditions
You can string together multiple conditions. Just make sure you put them in parenthesis.
# dont forget to seperate with parenthesis
# OR
df[(df["Type 1"] == "Grass") | (df["Type 2"] == "Flying")]
# AND
df[(df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")]
# AND OR
new_df = df[((df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")) | (df["HP"] > 200)]
new_df
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
121 | 113 | Chansey | Normal | NaN | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
202 | 187 | Hoppip | Grass | Flying | 35 | 35 | 40 | 35 | 55 | 50 | 2 | False |
203 | 188 | Skiploom | Grass | Flying | 55 | 45 | 50 | 45 | 65 | 80 | 2 | False |
204 | 189 | Jumpluff | Grass | Flying | 75 | 55 | 70 | 55 | 95 | 110 | 2 | False |
261 | 242 | Blissey | Normal | NaN | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
390 | 357 | Tropius | Grass | Flying | 99 | 68 | 83 | 72 | 87 | 51 | 3 | False |
551 | 492 | ShayminSky Forme | Grass | Flying | 100 | 103 | 75 | 120 | 75 | 127 | 4 | True |
%in%
: Filtering within a List
The combo of %in%
and !(%in%)
are probably one of the most used logical operators that I use in R. Luckily the equivalent in python is super straight forward. Just use .isin()
# x is in y
df[df['Type 1'].isin(['Fire', 'Water'])].head(5)
# x is not in y
df[~df['Type 1'].isin(['Fire', 'Water'])].head(5)
# saving a list as something
ls = ['Bug', 'Water']
df[df['Type 1'].isin(ls)].head(5)
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 7 | Squirtle | Water | NaN | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
10 | 8 | Wartortle | Water | NaN | 59 | 63 | 80 | 65 | 80 | 58 | 1 | False |
11 | 9 | Blastoise | Water | NaN | 79 | 83 | 100 | 85 | 105 | 78 | 1 | False |
12 | 9 | BlastoiseMega Blastoise | Water | NaN | 79 | 103 | 120 | 135 | 115 | 78 | 1 | False |
13 | 10 | Caterpie | Bug | NaN | 45 | 30 | 35 | 20 | 20 | 45 | 1 | False |
!is.na()
: Filtering Out NaN
Often the case when joining together different datasets, you will need to remove rows that have missing values. The .notnull()
operator will help with just that:
# You will see that Squirtle and co. are missing now :(
df[df['Type 2'].notnull()]
# | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
6 | 6 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
414 rows × 12 columns
Dropping Duplicates
I will admit, I think dropping duplicates is easier in pandas than in R with the df.drop_duplicates()
function.
# first lets create a dataframe that has duplicates to drop
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df
A | B | C | |
---|---|---|---|
0 | foo | 0 | A |
1 | foo | 1 | A |
2 | foo | 1 | B |
3 | bar | 1 | A |
You can drop all entirely duplicated rows with just df.drop_duplicates()
but you can also drop duplicates based on certain columns using the subset argument. By default, it will keep the first row and drop the second:
df.drop_duplicates(subset=['A', 'C'])
A | B | C | |
---|---|---|---|
0 | foo | 0 | A |
2 | foo | 1 | B |
3 | bar | 1 | A |
The keep arguement lets you decide if you want to keep the first or second duplicates or drop them all:
# keep the first (default)
df.drop_duplicates(subset=['A', 'C'], keep = 'first')
# keep the last
df.drop_duplicates(subset=['A', 'C'], keep = 'last')
# drop them all
df.drop_duplicates(subset=['A', 'C'], keep = False)
A | B | C | |
---|---|---|---|
2 | foo | 1 | B |
3 | bar | 1 | A |