Joining Data

Incomplete data sets, whether they are missing individual values or full rows and columns, are a common problem in data analysis. Luckily for us, Pandas has lots of tools to help us make these data sets easier to handle.

In this article, we will run through 2 of Pandas’ main methods for bundling dataFrames together – concatenating and merging.

Let’s set ourselves up with three one-row dataFrames, with stats from our recent matches.

In [1]:
import numpy as np
import pandas as pd

match1 = pd.DataFrame({'Opponent':['Selche FC'],
                      'GoalsFor':[1],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match2 = pd.DataFrame({'Opponent':['Sudaton FC'],
                      'GoalsFor':[3],
                       'GoalsAgainst':[0],
                       'Attendance':[53256]})

match3 = pd.DataFrame({'Opponent':['Ouestjambon United'],
                      'GoalsFor':[4],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match3
Out[1]:
Attendance GoalsAgainst GoalsFor Opponent
0 53225 1 4 Ouestjambon United

Concatenation

Our simplest method of joining data is to simply stick one on the ned of the other – as the concatenate method allows us to do. ‘pd.concat()’ with a list of our data frames will do the trick:

In [2]:
AllMatches = pd.concat([match1,match2,match3])
AllMatches
Out[2]:
Attendance GoalsAgainst GoalsFor Opponent
0 53225 1 1 Selche FC
0 53256 0 3 Sudaton FC
0 53225 1 4 Ouestjambon United

Merging

‘pd.merge()’ will allow us to stick data together left-to-right. Let’s first create more details for our matches above that we can then merge then:

In [3]:
match1scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Billy'],
                            'Opponent':['Selche FC']})
                      
match2scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Pip'],
                             'Opponent':['Sudaton FC']})

match3scorers = pd.DataFrame({'First':['Sally'],
                             'Last':['Sally'],
                             'Opponent':['Ouestjambon United']})

AllScorers = pd.concat([match1scorers,match2scorers,match3scorers]) 
AllScorers
Out[3]:
First Last Opponent
0 Sally Billy Selche FC
0 Sally Pip Sudaton FC
0 Sally Sally Ouestjambon United
In [4]:
pd.merge(AllMatches, AllScorers, how='inner',on='Opponent')
Out[4]:
Attendance GoalsAgainst GoalsFor Opponent First Last
0 53225 1 1 Selche FC Sally Billy
1 53256 0 3 Sudaton FC Sally Pip
2 53225 1 4 Ouestjambon United Sally Sally

Let’s break down ‘pd.merge()’. Anyone with any SQL experience will have a head start here, as this essentially mimicks merging in SQL.

‘pd.merge()’ takes four arguments. The first two are the dataFrames that we want to merge.

Next, we have the ‘how’ argument, which dictates the type of join that we need. ‘Inner’ is the simplest, and we use that in this example but you should read up on the other types.

Finally is ‘on’, which is the column that we build the dataFrame around. Pandas looks for this value in all merged dataFrames, then adds the other values from the matching rows.

Summary

With merge and concatenate, we have two examples of Pandas tools used to join datasets.

Concatenation allows us to stitch together two datasets with the same columns. Useful if we are adding to an already-existing dataset or glueing together two iterations of data collection. It is much quicker than opening Excel and pasting the data manually!

We have also seen an example of merge, which connects two datasets across a column that is common between the two dataFrames.

These examples are relatively simple and only serve as an introduction to joining data. For more complex cases, check out resources on pandas.join() and more in-depth examples of pandas.merge()

Leave a Reply