Advanced Filtering
Choosing what is in a data set is a fundamental operation of working with data. That all starts with filtering. We are going to be working with our pokemon dataset again to perform some more complicated filters.
Filter Equivalents
| Description | R | Python |
|---|---|---|
| A simple filter | filter() | df[df['col'] == 'cond'] |
| Multiple filters | filter() | df[ (df['col'] == 'cond1') & (df['col'] == 'cond2')] |
| Filter in a list | filter(x %in% y) | df[df['col'].isin(list)] |
| Filter not in a list | filter(!x %in% y) | df[~df['col'].isin(list)] |
| Drop duplicates | filter(!duplicated()) | df.drop_duplicates() |
| Drop duplicated column | filter(!duplicated()) | df.drop_duplicates(subset = ['a', 'c']) |
import pandas as pd
df = pd.read_csv("http://www.corydonbaylor.io/pokemon")
df.head(2)
| # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
The Basic
Just as a review, here is how you do the most basic of filters:
df[df['Type 1'] == "Fire"].head(5)
| # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 4 | Charmander | Fire | NaN | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
| 5 | 5 | Charmeleon | Fire | NaN | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
| 6 | 6 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
| 7 | 6 | CharizardMega Charizard X | Fire | Dragon | 78 | 130 | 111 | 130 | 85 | 100 | 1 | False |
| 8 | 6 | CharizardMega Charizard Y | Fire | Flying | 78 | 104 | 78 | 159 | 115 | 100 | 1 | False |
Multiple Conditions
You can string together multiple conditions. Just make sure you put them in parenthesis.
# dont forget to seperate with parenthesis
# OR
df[(df["Type 1"] == "Grass") | (df["Type 2"] == "Flying")]
# AND
df[(df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")]
# AND OR
new_df = df[((df["Type 1"] == "Grass") & (df["Type 2"] == "Flying")) | (df["HP"] > 200)]
new_df
| # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 121 | 113 | Chansey | Normal | NaN | 250 | 5 | 5 | 35 | 105 | 50 | 1 | False |
| 202 | 187 | Hoppip | Grass | Flying | 35 | 35 | 40 | 35 | 55 | 50 | 2 | False |
| 203 | 188 | Skiploom | Grass | Flying | 55 | 45 | 50 | 45 | 65 | 80 | 2 | False |
| 204 | 189 | Jumpluff | Grass | Flying | 75 | 55 | 70 | 55 | 95 | 110 | 2 | False |
| 261 | 242 | Blissey | Normal | NaN | 255 | 10 | 10 | 75 | 135 | 55 | 2 | False |
| 390 | 357 | Tropius | Grass | Flying | 99 | 68 | 83 | 72 | 87 | 51 | 3 | False |
| 551 | 492 | ShayminSky Forme | Grass | Flying | 100 | 103 | 75 | 120 | 75 | 127 | 4 | True |
%in%: Filtering within a List
The combo of %in% and !(%in%) are probably one of the most used logical operators that I use in R. Luckily the equivalent in python is super straight forward. Just use .isin()
# x is in y
df[df['Type 1'].isin(['Fire', 'Water'])].head(5)
# x is not in y
df[~df['Type 1'].isin(['Fire', 'Water'])].head(5)
# saving a list as something
ls = ['Bug', 'Water']
df[df['Type 1'].isin(ls)].head(5)
| # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 7 | Squirtle | Water | NaN | 44 | 48 | 65 | 50 | 64 | 43 | 1 | False |
| 10 | 8 | Wartortle | Water | NaN | 59 | 63 | 80 | 65 | 80 | 58 | 1 | False |
| 11 | 9 | Blastoise | Water | NaN | 79 | 83 | 100 | 85 | 105 | 78 | 1 | False |
| 12 | 9 | BlastoiseMega Blastoise | Water | NaN | 79 | 103 | 120 | 135 | 115 | 78 | 1 | False |
| 13 | 10 | Caterpie | Bug | NaN | 45 | 30 | 35 | 20 | 20 | 45 | 1 | False |
!is.na(): Filtering Out NaN
Often the case when joining together different datasets, you will need to remove rows that have missing values. The .notnull() operator will help with just that:
# You will see that Squirtle and co. are missing now :(
df[df['Type 2'].notnull()]
| # | Name | Type 1 | Type 2 | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Bulbasaur | Grass | Poison | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
| 1 | 2 | Ivysaur | Grass | Poison | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
| 2 | 3 | Venusaur | Grass | Poison | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
| 3 | 3 | VenusaurMega Venusaur | Grass | Poison | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
| 6 | 6 | Charizard | Fire | Flying | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 795 | 719 | Diancie | Rock | Fairy | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
| 796 | 719 | DiancieMega Diancie | Rock | Fairy | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
| 797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
| 798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
| 799 | 721 | Volcanion | Fire | Water | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
414 rows × 12 columns
Dropping Duplicates
I will admit, I think dropping duplicates is easier in pandas than in R with the df.drop_duplicates() function.
# first lets create a dataframe that has duplicates to drop
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df
| A | B | C | |
|---|---|---|---|
| 0 | foo | 0 | A |
| 1 | foo | 1 | A |
| 2 | foo | 1 | B |
| 3 | bar | 1 | A |
You can drop all entirely duplicated rows with just df.drop_duplicates() but you can also drop duplicates based on certain columns using the subset argument. By default, it will keep the first row and drop the second:
df.drop_duplicates(subset=['A', 'C'])
| A | B | C | |
|---|---|---|---|
| 0 | foo | 0 | A |
| 2 | foo | 1 | B |
| 3 | bar | 1 | A |
The keep arguement lets you decide if you want to keep the first or second duplicates or drop them all:
# keep the first (default)
df.drop_duplicates(subset=['A', 'C'], keep = 'first')
# keep the last
df.drop_duplicates(subset=['A', 'C'], keep = 'last')
# drop them all
df.drop_duplicates(subset=['A', 'C'], keep = False)
| A | B | C | |
|---|---|---|---|
| 2 | foo | 1 | B |
| 3 | bar | 1 | A |