Grouping Data

With DataFrames giving us the opportunity to store huge grids of data, we will sometimes want to group particular parts of our data set for analysis. Pandas’ ‘groupby’ method gives us an easy way to do so.

Let’s put together a DataFrame of a team’s recent results with a dictionary:

In [1]:
import pandas as pd

#data is in a dictionary, each entry will be a column
#The first part of the entry is the column name, the second the values
data = {'Opponent':
        ["Atletico Jave","Newtown FC", 
         "Bunton Town", "Fentborough Dynamo"],
       'Location':
        ["Home","Away","Away","Home"],
        'GoalsFor':
        [2,4,3,0],
        'GoalsAgainst':
        [4,0,2,2]}

Matches = pd.DataFrame(data)
Matches
Out[1]:
GoalsAgainst GoalsFor Location Opponent
0 4 2 Home Atletico Jave
1 0 4 Away Newtown FC
2 2 3 Away Bunton Town
3 2 0 Home Fentborough Dynamo

An obvious way to group this data is by home and away matches.

Let’s use the ‘.groupby()’ method to do so. We just have to provide the column that we want to group by. In this case, location.

We’ll assign that to a variable, then call ‘.mean()’ to find the average.

In [2]:
HAMatches = Matches.groupby('Location')
HAMatches.mean()
Out[2]:
GoalsAgainst GoalsFor
Location
Away 1.0 3.5
Home 3.0 1.0

Or cut out the variable and chain the ‘.mean()’ onto the end. Or chain another method:

In [3]:
Matches.groupby('Location').mean()
Out[3]:
GoalsAgainst GoalsFor
Location
Away 1.0 3.5
Home 3.0 1.0
In [4]:
#Describes the dataset for each variable within - this is awesome!
Matches.groupby('Location').describe()
Out[4]:
GoalsAgainst GoalsFor
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Location
Away 2.0 1.0 1.414214 0.0 0.5 1.0 1.5 2.0 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0
Home 2.0 3.0 1.414214 2.0 2.5 3.0 3.5 4.0 2.0 1.0 1.414214 0.0 0.50 1.0 1.50 2.0
In [5]:
#Let's step up the chaining...
#'Groupby' location, then describe it to me...
#Then 'transpose' it (flip it onto its side)...
#Finally, just give me 'Away' data
Matches.groupby('Location').describe().transpose()['Away']
Out[5]:
GoalsAgainst  count    2.000000
              mean     1.000000
              std      1.414214
              min      0.000000
              25%      0.500000
              50%      1.000000
              75%      1.500000
              max      2.000000
GoalsFor      count    2.000000
              mean     3.500000
              std      0.707107
              min      3.000000
              25%      3.250000
              50%      3.500000
              75%      3.750000
              max      4.000000
Name: Away, dtype: float64
In [6]:
print("All that work done in just " + 
      str(len("Matches.groupby('Location').describe().transpose()['Away']"))
     + " characters!")
All that work done in just 58 characters!

Summary

It is staggering how easily we can not only group data, but to also use Pandas to get some insight into our data. Really good job following this far.

We learned how to use ‘groupby()’ to group by location – home or away. We then used methods to describe our data, find averages and even to change the shape of our data frames. Really impressive stuff!

Next up, you might want to take a look at how we can join dataFrames together, how to deal with missing values or how to use even more operations.

Leave a Reply