Combining Dataframes

Pretty much any complex analysis is going to require you to combine different datasets. Here is how to do that in python.

I am going to create two dataframes based around Pokemon. Unfortunately, I am going to make up the data, so don't use this for official pokemon reporting.

R equivalents:

inner_join() = pd.merge(df1, df2, on = 'id')
left_join() = pd.merge(df1, df2, on = 'id', how = 'left')
right_join() = pd.merge(df1, df2, on = 'id', how = 'right')
outer_join() = pd.merge(df1, df2, on = 'id', how = 'outer')
rbind() = pd.concat([df1, df2])

# first we are going to create two datasets.

import pandas as pd

# this is based on the games and 
df1 = pd.DataFrame({
    "id" : [25, 9, 59, 56, 104],
    "pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
    "type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Ground'],
    "hp" : [35, 79, 90, 40, 60]
})

# this one is missing the pokemon column 
# but has two additional rows: alakazam and gengar
# the notes refer to show notes
df2 = pd.DataFrame({
    "id": [25, 9, 59, 56, 65, 94],
    "episode": [1,15, 38, 12, 104, 16],
    "type" : ['Electric', 'Water', 'Fire', 'Fighting', 'Psychic', 'Ghost'],
    "notes": ['main character', 'not on ashes team', 'in credits', 
              'idk', 'magician', 'ghost']

})

# same info as df1 but different column name
# the notes refer to game notes
df3 = pd.DataFrame({
    "ids" : [25, 9, 59, 56, 104],
    "pokemon" : ['Pikachu', 'Blastoise', 'Arcanine', 'Mankey', 'Cubone'],
    "hp" : [50, 150, 90, 35, 60],
    "notes": ['fast but low hp', 'slow, lotsa hp', 'doggie!', 'monkey!', 'cool hat']

    
})

`inner_join()`: Merge Method

The first method we are going to use is the merge method. If we simply use pd.merge() it will perform an inner join, which as you can see will drop poor cubone since he isn't in the first dataframe.

pd.merge(df1, df2)

	id	pokemon	type	hp	episode	notes
0	25	Pikachu	Electric	35	1	main character
1	9	Blastoise	Water	79	15	not on ashes team
2	59	Arcanine	Fire	90	38	in credits
3	56	Mankey	Fighting	40	12	idk

It is usually good pratice to specify the column you are joining on (which sometimes will have different names).

# its good practice to specify the name you are merging on
pd.merge(df1, df2, on ='id')

# here is how to merge when the dfs have different column names
# notice how it keeps both id columns
pd.merge(df3, df2, left_on ='ids', right_on='id')

	ids	pokemon	hp	notes_x	id	episode	type	notes_y
0	25	Pikachu	50	fast but low hp	25	1	Electric	main character
1	9	Blastoise	150	slow, lotsa hp	9	15	Water	not on ashes team
2	59	Arcanine	90	doggie!	59	38	Fire	in credits
3	56	Mankey	35	monkey!	56	12	Fighting	idk

If we had tried to join df3 and df2 without specifying the column, than it would have tried to join it on notes as well and not found any matches, but because we specified that we just wanted it joined on the id columns, it appended an _x and _y to indicate which df each came from.

You can specify what you want appended as seen below:

pd.merge(df3, df2, left_on ='ids', right_on='id', suffixes = ['_left', '_right'])

	ids	pokemon	hp	notes_left	id	episode	type	notes_right
0	25	Pikachu	50	fast but low hp	25	1	Electric	main character
1	9	Blastoise	150	slow, lotsa hp	9	15	Water	not on ashes team
2	59	Arcanine	90	doggie!	59	38	Fire	in credits
3	56	Mankey	35	monkey!	56	12	Fighting	idk

`left_join()`: Merge Method Cont.

The merge method actually just has an argument that allows us to specify what type of join we want to do:

# outer join: keeps everything
pd.merge(df1, df2, on = ['id', 'type'], how = 'outer')

# right join: keeps everything on right, drops extras on left
pd.merge(df1, df2, on = ['id', 'type'], how = 'right')

# left join: keeps everything on left, drops extras on right
pd.merge(df1, df2, on = ['id', 'type'], how = 'left')

	id	pokemon	type	hp	episode	notes
0	25	Pikachu	Electric	35	1.0	main character
1	9	Blastoise	Water	79	15.0	not on ashes team
2	59	Arcanine	Fire	90	38.0	in credits
3	56	Mankey	Fighting	40	12.0	idk
4	104	Cubone	Ground	60	NaN	NaN

`rbind()`: Stacking Data On Top of Eachother

Unlike in R, Python doesn't really seem to care if two dataframes have different column names. It will just let you join them and fill in the blanks with NaN.

See below for two different methods to combine data:

pd.concat([df1,df2])

	id	pokemon	type	hp	episode	notes
0	25	Pikachu	Electric	35.0	NaN	NaN
1	9	Blastoise	Water	79.0	NaN	NaN
2	59	Arcanine	Fire	90.0	NaN	NaN
3	56	Mankey	Fighting	40.0	NaN	NaN
4	104	Cubone	Ground	60.0	NaN	NaN
0	25	NaN	Electric	NaN	1.0	main character
1	9	NaN	Water	NaN	15.0	not on ashes team
2	59	NaN	Fire	NaN	38.0	in credits
3	56	NaN	Fighting	NaN	12.0	idk
4	65	NaN	Psychic	NaN	104.0	magician
5	94	NaN	Ghost	NaN	16.0	ghost

df1.append(df2)

	id	pokemon	type	hp	episode	notes
0	25	Pikachu	Electric	35.0	NaN	NaN
1	9	Blastoise	Water	79.0	NaN	NaN
2	59	Arcanine	Fire	90.0	NaN	NaN
3	56	Mankey	Fighting	40.0	NaN	NaN
4	104	Cubone	Ground	60.0	NaN	NaN
0	25	NaN	Electric	NaN	1.0	main character
1	9	NaN	Water	NaN	15.0	not on ashes team
2	59	NaN	Fire	NaN	38.0	in credits
3	56	NaN	Fighting	NaN	12.0	idk
4	65	NaN	Psychic	NaN	104.0	magician
5	94	NaN	Ghost	NaN	16.0	ghost

Combining Dataframes

inner_join(): Merge Method

left_join(): Merge Method Cont.

rbind(): Stacking Data On Top of Eachother

`inner_join()`: Merge Method

`left_join()`: Merge Method Cont.

`rbind()`: Stacking Data On Top of Eachother