Combining Dataframes

Pretty much any complex analysis is going to require you to combine different datasets. Here is how to do that in python.

I am going to create two dataframes based around Pokemon. Unfortunately, I am going to make up the data, so don't use this for official pokemon reporting.

R equivalents:

# first we are going to create two datasets.

import pandas as pd

# this is based on the games and 
df1 = pd.DataFrame({
    "id" : [25, 9, 59, 56, 104],
    "pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
    "type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Ground'],
    "hp" : [35, 79, 90, 40, 60]
})

# this one is missing the pokemon column 
# but has two additional rows: alakazam and gengar
# the notes refer to show notes
df2 = pd.DataFrame({
    "id": [25, 9, 59, 56, 65, 94],
    "episode": [1,15, 38, 12, 104, 16],
    "type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Psychic', 'Ghost'],
    "notes": ['main character', 'not on ashes team', 'in credits', 
              'idk', 'magician', 'ghost']

})

# same info as df1 but different column name
# the notes refer to game notes
df3 = pd.DataFrame({
    "ids" : [25, 9, 59, 56, 104],
    "pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
    "hp" : [50, 150, 90, 35, 60],
    "notes": ['fast but low hp', 'slow, lotsa hp', 'doggie!', 'monkey!', 'cool hat']

    
})

inner_join(): Merge Method

The first method we are going to use is the merge method. If we simply use pd.merge() it will perform an inner join, which as you can see will drop poor cubone since he isn't in the first dataframe.

pd.merge(df1, df2)
id pokemon type hp episode notes
0 25 Pikachu Electric 35 1 main character
1 9 Blastoise Water 79 15 not on ashes team
2 59 Arcanine Fire 90 38 in credits
3 56 Mankey Fighting 40 12 idk

It is usually good pratice to specify the column you are joining on (which sometimes will have different names).

# its good practice to specify the name you are merging on
pd.merge(df1, df2, on ='id')

# here is how to merge when the dfs have different column names
# notice how it keeps both id columns
pd.merge(df3, df2, left_on ='ids', right_on='id')
ids pokemon hp notes_x id episode type notes_y
0 25 Pikachu 50 fast but low hp 25 1 Electric main character
1 9 Blastoise 150 slow, lotsa hp 9 15 Water not on ashes team
2 59 Arcanine 90 doggie! 59 38 Fire in credits
3 56 Mankey 35 monkey! 56 12 Fighting idk

If we had tried to join df3 and df2 without specifying the column, than it would have tried to join it on notes as well and not found any matches, but because we specified that we just wanted it joined on the id columns, it appended an _x and _y to indicate which df each came from.

You can specify what you want appended as seen below:

pd.merge(df3, df2, left_on ='ids', right_on='id', suffixes = ['_left', '_right'])
ids pokemon hp notes_left id episode type notes_right
0 25 Pikachu 50 fast but low hp 25 1 Electric main character
1 9 Blastoise 150 slow, lotsa hp 9 15 Water not on ashes team
2 59 Arcanine 90 doggie! 59 38 Fire in credits
3 56 Mankey 35 monkey! 56 12 Fighting idk

left_join(): Merge Method Cont.

The merge method actually just has an argument that allows us to specify what type of join we want to do:

# outer join: keeps everything
pd.merge(df1, df2, on = ['id', 'type'], how = 'outer')

# right join: keeps everything on right, drops extras on left
pd.merge(df1, df2, on = ['id', 'type'], how = 'right')

# left join: keeps everything on left, drops extras on right
pd.merge(df1, df2, on = ['id', 'type'], how = 'left')
id pokemon type hp episode notes
0 25 Pikachu Electric 35 1.0 main character
1 9 Blastoise Water 79 15.0 not on ashes team
2 59 Arcanine Fire 90 38.0 in credits
3 56 Mankey Fighting 40 12.0 idk
4 104 Cubone Ground 60 NaN NaN

rbind(): Stacking Data On Top of Eachother

Unlike in R, Python doesn't really seem to care if two dataframes have different column names. It will just let you join them and fill in the blanks with NaN.

See below for two different methods to combine data:

pd.concat([df1,df2])
id pokemon type hp episode notes
0 25 Pikachu Electric 35.0 NaN NaN
1 9 Blastoise Water 79.0 NaN NaN
2 59 Arcanine Fire 90.0 NaN NaN
3 56 Mankey Fighting 40.0 NaN NaN
4 104 Cubone Ground 60.0 NaN NaN
0 25 NaN Electric NaN 1.0 main character
1 9 NaN Water NaN 15.0 not on ashes team
2 59 NaN Fire NaN 38.0 in credits
3 56 NaN Fighting NaN 12.0 idk
4 65 NaN Psychic NaN 104.0 magician
5 94 NaN Ghost NaN 16.0 ghost
df1.append(df2)
id pokemon type hp episode notes
0 25 Pikachu Electric 35.0 NaN NaN
1 9 Blastoise Water 79.0 NaN NaN
2 59 Arcanine Fire 90.0 NaN NaN
3 56 Mankey Fighting 40.0 NaN NaN
4 104 Cubone Ground 60.0 NaN NaN
0 25 NaN Electric NaN 1.0 main character
1 9 NaN Water NaN 15.0 not on ashes team
2 59 NaN Fire NaN 38.0 in credits
3 56 NaN Fighting NaN 12.0 idk
4 65 NaN Psychic NaN 104.0 magician
5 94 NaN Ghost NaN 16.0 ghost