Incomplete data sets, whether they are missing individual values or full rows and columns, are a common problem in data analysis. Luckily for us, Pandas has lots of tools to help us make these data sets easier to handle.
In this article, we will run through 2 of Pandas’ main methods for bundling dataFrames together – concatenating and merging.
Let’s set ourselves up with three one-row dataFrames, with stats from our recent matches.
import numpy as np
import pandas as pd
match1 = pd.DataFrame({'Opponent':['Selche FC'],
'GoalsFor':[1],
'GoalsAgainst':[1],
'Attendance':[53225]})
match2 = pd.DataFrame({'Opponent':['Sudaton FC'],
'GoalsFor':[3],
'GoalsAgainst':[0],
'Attendance':[53256]})
match3 = pd.DataFrame({'Opponent':['Ouestjambon United'],
'GoalsFor':[4],
'GoalsAgainst':[1],
'Attendance':[53225]})
match3
Concatenation
Our simplest method of joining data is to simply stick one on the ned of the other – as the concatenate method allows us to do. ‘pd.concat()’ with a list of our data frames will do the trick:
AllMatches = pd.concat([match1,match2,match3])
AllMatches
Merging
‘pd.merge()’ will allow us to stick data together left-to-right. Let’s first create more details for our matches above that we can then merge then:
match1scorers = pd.DataFrame({'First':['Sally'],
'Last':['Billy'],
'Opponent':['Selche FC']})
match2scorers = pd.DataFrame({'First':['Sally'],
'Last':['Pip'],
'Opponent':['Sudaton FC']})
match3scorers = pd.DataFrame({'First':['Sally'],
'Last':['Sally'],
'Opponent':['Ouestjambon United']})
AllScorers = pd.concat([match1scorers,match2scorers,match3scorers])
AllScorers
pd.merge(AllMatches, AllScorers, how='inner',on='Opponent')
Let’s break down ‘pd.merge()’. Anyone with any SQL experience will have a head start here, as this essentially mimicks merging in SQL.
‘pd.merge()’ takes four arguments. The first two are the dataFrames that we want to merge.
Next, we have the ‘how’ argument, which dictates the type of join that we need. ‘Inner’ is the simplest, and we use that in this example but you should read up on the other types.
Finally is ‘on’, which is the column that we build the dataFrame around. Pandas looks for this value in all merged dataFrames, then adds the other values from the matching rows.
Summary
With merge and concatenate, we have two examples of Pandas tools used to join datasets.
Concatenation allows us to stitch together two datasets with the same columns. Useful if we are adding to an already-existing dataset or glueing together two iterations of data collection. It is much quicker than opening Excel and pasting the data manually!
We have also seen an example of merge, which connects two datasets across a column that is common between the two dataFrames.
These examples are relatively simple and only serve as an introduction to joining data. For more complex cases, check out resources on pandas.join() and more in-depth examples of pandas.merge()