Combining Dataframes
Pretty much any complex analysis is going to require you to combine different datasets. Here is how to do that in python.
I am going to create two dataframes based around Pokemon. Unfortunately, I am going to make up the data, so don't use this for official pokemon reporting.
R equivalents:
inner_join()
=pd.merge(df1, df2, on = 'id')
left_join()
=pd.merge(df1, df2, on = 'id', how = 'left')
right_join()
=pd.merge(df1, df2, on = 'id', how = 'right')
outer_join()
=pd.merge(df1, df2, on = 'id', how = 'outer')
rbind()
=pd.concat([df1, df2])
# first we are going to create two datasets.
import pandas as pd
# this is based on the games and
df1 = pd.DataFrame({
"id" : [25, 9, 59, 56, 104],
"pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
"type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Ground'],
"hp" : [35, 79, 90, 40, 60]
})
# this one is missing the pokemon column
# but has two additional rows: alakazam and gengar
# the notes refer to show notes
df2 = pd.DataFrame({
"id": [25, 9, 59, 56, 65, 94],
"episode": [1,15, 38, 12, 104, 16],
"type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Psychic', 'Ghost'],
"notes": ['main character', 'not on ashes team', 'in credits',
'idk', 'magician', 'ghost']
})
# same info as df1 but different column name
# the notes refer to game notes
df3 = pd.DataFrame({
"ids" : [25, 9, 59, 56, 104],
"pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
"hp" : [50, 150, 90, 35, 60],
"notes": ['fast but low hp', 'slow, lotsa hp', 'doggie!', 'monkey!', 'cool hat']
})
inner_join()
: Merge Method
The first method we are going to use is the merge method. If we simply use pd.merge()
it will perform an inner join, which as you can see will drop poor cubone since he isn't in the first dataframe.
pd.merge(df1, df2)
id | pokemon | type | hp | episode | notes | |
---|---|---|---|---|---|---|
0 | 25 | Pikachu | Electric | 35 | 1 | main character |
1 | 9 | Blastoise | Water | 79 | 15 | not on ashes team |
2 | 59 | Arcanine | Fire | 90 | 38 | in credits |
3 | 56 | Mankey | Fighting | 40 | 12 | idk |
It is usually good pratice to specify the column you are joining on (which sometimes will have different names).
# its good practice to specify the name you are merging on
pd.merge(df1, df2, on ='id')
# here is how to merge when the dfs have different column names
# notice how it keeps both id columns
pd.merge(df3, df2, left_on ='ids', right_on='id')
ids | pokemon | hp | notes_x | id | episode | type | notes_y | |
---|---|---|---|---|---|---|---|---|
0 | 25 | Pikachu | 50 | fast but low hp | 25 | 1 | Electric | main character |
1 | 9 | Blastoise | 150 | slow, lotsa hp | 9 | 15 | Water | not on ashes team |
2 | 59 | Arcanine | 90 | doggie! | 59 | 38 | Fire | in credits |
3 | 56 | Mankey | 35 | monkey! | 56 | 12 | Fighting | idk |
If we had tried to join df3 and df2 without specifying the column, than it would have tried to join it on notes as well and not found any matches, but because we specified that we just wanted it joined on the id columns, it appended an _x
and _y
to indicate which df each came from.
You can specify what you want appended as seen below:
pd.merge(df3, df2, left_on ='ids', right_on='id', suffixes = ['_left', '_right'])
ids | pokemon | hp | notes_left | id | episode | type | notes_right | |
---|---|---|---|---|---|---|---|---|
0 | 25 | Pikachu | 50 | fast but low hp | 25 | 1 | Electric | main character |
1 | 9 | Blastoise | 150 | slow, lotsa hp | 9 | 15 | Water | not on ashes team |
2 | 59 | Arcanine | 90 | doggie! | 59 | 38 | Fire | in credits |
3 | 56 | Mankey | 35 | monkey! | 56 | 12 | Fighting | idk |
left_join()
: Merge Method Cont.
The merge method actually just has an argument that allows us to specify what type of join we want to do:
# outer join: keeps everything
pd.merge(df1, df2, on = ['id', 'type'], how = 'outer')
# right join: keeps everything on right, drops extras on left
pd.merge(df1, df2, on = ['id', 'type'], how = 'right')
# left join: keeps everything on left, drops extras on right
pd.merge(df1, df2, on = ['id', 'type'], how = 'left')
id | pokemon | type | hp | episode | notes | |
---|---|---|---|---|---|---|
0 | 25 | Pikachu | Electric | 35 | 1.0 | main character |
1 | 9 | Blastoise | Water | 79 | 15.0 | not on ashes team |
2 | 59 | Arcanine | Fire | 90 | 38.0 | in credits |
3 | 56 | Mankey | Fighting | 40 | 12.0 | idk |
4 | 104 | Cubone | Ground | 60 | NaN | NaN |
rbind()
: Stacking Data On Top of Eachother
Unlike in R, Python doesn't really seem to care if two dataframes have different column names. It will just let you join them and fill in the blanks with NaN
.
See below for two different methods to combine data:
pd.concat([df1,df2])
id | pokemon | type | hp | episode | notes | |
---|---|---|---|---|---|---|
0 | 25 | Pikachu | Electric | 35.0 | NaN | NaN |
1 | 9 | Blastoise | Water | 79.0 | NaN | NaN |
2 | 59 | Arcanine | Fire | 90.0 | NaN | NaN |
3 | 56 | Mankey | Fighting | 40.0 | NaN | NaN |
4 | 104 | Cubone | Ground | 60.0 | NaN | NaN |
0 | 25 | NaN | Electric | NaN | 1.0 | main character |
1 | 9 | NaN | Water | NaN | 15.0 | not on ashes team |
2 | 59 | NaN | Fire | NaN | 38.0 | in credits |
3 | 56 | NaN | Fighting | NaN | 12.0 | idk |
4 | 65 | NaN | Psychic | NaN | 104.0 | magician |
5 | 94 | NaN | Ghost | NaN | 16.0 | ghost |
df1.append(df2)
id | pokemon | type | hp | episode | notes | |
---|---|---|---|---|---|---|
0 | 25 | Pikachu | Electric | 35.0 | NaN | NaN |
1 | 9 | Blastoise | Water | 79.0 | NaN | NaN |
2 | 59 | Arcanine | Fire | 90.0 | NaN | NaN |
3 | 56 | Mankey | Fighting | 40.0 | NaN | NaN |
4 | 104 | Cubone | Ground | 60.0 | NaN | NaN |
0 | 25 | NaN | Electric | NaN | 1.0 | main character |
1 | 9 | NaN | Water | NaN | 15.0 | not on ashes team |
2 | 59 | NaN | Fire | NaN | 38.0 | in credits |
3 | 56 | NaN | Fighting | NaN | 12.0 | idk |
4 | 65 | NaN | Psychic | NaN | 104.0 | magician |
5 | 94 | NaN | Ghost | NaN | 16.0 | ghost |