Incomplete data sets, whether they are missing individual values or full rows and columns, are a common problem in data analysis. Luckily for us, Pandas has lots of tools to help us make these data sets easier to handle.

In this article, we will run through 2 of Pandas’ main methods for bundling dataFrames together – concatenating and merging.

Let’s set ourselves up with three one-row dataFrames, with stats from our recent matches.

In [1]:

import numpy as np
import pandas as pd

match1 = pd.DataFrame({'Opponent':['Selche FC'],
                      'GoalsFor':[1],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match2 = pd.DataFrame({'Opponent':['Sudaton FC'],
                      'GoalsFor':[3],
                       'GoalsAgainst':[0],
                       'Attendance':[53256]})

match3 = pd.DataFrame({'Opponent':['Ouestjambon United'],
                      'GoalsFor':[4],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match3

Out[1]:

	Attendance	GoalsAgainst	GoalsFor	Opponent
0	53225	1	4	Ouestjambon United

Concatenation

Our simplest method of joining data is to simply stick one on the ned of the other – as the concatenate method allows us to do. ‘pd.concat()’ with a list of our data frames will do the trick:

In [2]:

AllMatches = pd.concat([match1,match2,match3])
AllMatches

Out[2]:

Attendance	GoalsAgainst	GoalsFor	Opponent
53225	1	1	Selche FC
53256	0	3	Sudaton FC
53225	1	4	Ouestjambon United

Merging

‘pd.merge()’ will allow us to stick data together left-to-right. Let’s first create more details for our matches above that we can then merge then:

In [3]:

match1scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Billy'],
                            'Opponent':['Selche FC']})
                      
match2scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Pip'],
                             'Opponent':['Sudaton FC']})

match3scorers = pd.DataFrame({'First':['Sally'],
                             'Last':['Sally'],
                             'Opponent':['Ouestjambon United']})

AllScorers = pd.concat([match1scorers,match2scorers,match3scorers]) 
AllScorers

Out[3]:

First	Last	Opponent
Sally	Billy	Selche FC
Sally	Pip	Sudaton FC
Sally	Sally	Ouestjambon United

In [4]:

pd.merge(AllMatches, AllScorers, how='inner',on='Opponent')

Out[4]:

	Attendance	GoalsAgainst	GoalsFor	Opponent	First	Last
0	53225	1	1	Selche FC	Sally	Billy
1	53256	0	3	Sudaton FC	Sally	Pip
2	53225	1	4	Ouestjambon United	Sally	Sally

Let’s break down ‘pd.merge()’. Anyone with any SQL experience will have a head start here, as this essentially mimicks merging in SQL.

‘pd.merge()’ takes four arguments. The first two are the dataFrames that we want to merge.

Next, we have the ‘how’ argument, which dictates the type of join that we need. ‘Inner’ is the simplest, and we use that in this example but you should read up on the other types.

Finally is ‘on’, which is the column that we build the dataFrame around. Pandas looks for this value in all merged dataFrames, then adds the other values from the matching rows.

Summary

With merge and concatenate, we have two examples of Pandas tools used to join datasets.

Concatenation allows us to stitch together two datasets with the same columns. Useful if we are adding to an already-existing dataset or glueing together two iterations of data collection. It is much quicker than opening Excel and pasting the data manually!

We have also seen an example of merge, which connects two datasets across a column that is common between the two dataFrames.

These examples are relatively simple and only serve as an introduction to joining data. For more complex cases, check out resources on pandas.join() and more in-depth examples of pandas.merge()

Pandas