Describing Datasets

Pandas is not only a fantastic module and community around manipulating our datasets, it also gives tools for analysing and describing our data.

This article will take you through just a few of the methods that we have to describe our dataset. Let’s get started by firing up a season-long dataset of referees and their cards given in each game last season.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("../data/Refs.csv")
In [2]:
#Use '.head()' to see the top rows and check out the structure

df.head()
Out[2]:
Date HomeTeam AwayTeam Referee HF AF TF HY AY TY HR AR TR
0 13/08/2016 Burnley Swansea J Moss 10 14 24 3 2 5 0 0 0
1 13/08/2016 Crystal Palace West Brom C Pawson 12 15 27 2 2 4 0 0 0
2 13/08/2016 Everton Tottenham M Atkinson 10 14 24 0 0 0 0 0 0
3 13/08/2016 Hull Leicester M Dean 8 17 25 2 2 4 0 0 0
4 13/08/2016 Man City Sunderland R Madley 11 14 25 1 2 3 0 0 0
In [3]:
#Let's change those shorthand column titles to something more intuitive

df.columns = ['Date','HomeTeam','AwayTeam',
                 'Referee','HomeFouls','AwayFouls',
                 'TotalFouls','HomeYellows','AwayYellows',
                'TotalYellows', 'HomeReds','AwayReds','TotalReds']

df.head(2)
Out[3]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
0 13/08/2016 Burnley Swansea J Moss 10 14 24 3 2 5 0 0 0
1 13/08/2016 Crystal Palace West Brom C Pawson 12 15 27 2 2 4 0 0 0
In [4]:
#And do we have a complete set of 380 matches? len() will tell us.

len(df)
Out[4]:
380

Descriptive statistics

The easiest way to produce an en-masse summary of our dataset is with the ‘.describe()’ method.

This will give us a whole new table of statistics for each numerical column:

  • Count – how many values are there?
  • Mean – what is the mean average? (Sum of values/count of values)
  • STD – what is the standard deviation? This number describes how widely the group differs around the average. If we have a normal distribution, 68% of our values will be within one STD either side of the average.
  • Min – the smallest value in our array
  • 25%/50%/75% – what value accounts for 25%/50%/75% of the data?
  • Max – the highest value in our array
In [5]:
df.describe()
Out[5]:
HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
count 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000
mean 11.063158 11.657895 22.721053 1.744737 1.886842 3.631579 0.055263 0.052632 0.107895
std 3.468853 3.556378 5.176166 1.256371 1.301647 1.969705 0.240050 0.223591 0.319037
min 2.000000 3.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 9.000000 19.000000 1.000000 1.000000 2.000000 0.000000 0.000000 0.000000
50% 11.000000 12.000000 23.000000 2.000000 2.000000 4.000000 0.000000 0.000000 0.000000
75% 13.000000 14.000000 25.000000 3.000000 3.000000 5.000000 0.000000 0.000000 0.000000
max 24.000000 24.000000 38.000000 6.000000 6.000000 11.000000 2.000000 1.000000 2.000000

So what do we learn? On average, we have between 3 and 4 yellows in a game and that the away team are only slightly more likely to get more cards. Fouls and red cards are also very close between both teams.

In 68% of games, we expect between 1.5 and 5.5 yellow cards.

At least one game had 38 fouls. That’s roughly one every two and a half minutes!

Describing with groups

Our describe table above is great for a broad brushstroke, but it would be helpful to look at our referees individually. Let’s use .groupby() to create a dataset grouped by the ‘Referee’ column

In [6]:
groupedRefs = df.groupby("Referee")

We can now apply some operations to check out our data by referee:

In [7]:
#All averavges

groupedRefs.describe()
Out[7]:
AwayFouls AwayReds TotalReds TotalYellows
count mean std min 25% 50% 75% max count mean 75% max count mean std min 25% 50% 75% max
Referee
A Marriner 28.0 9.857143 2.977873 5.0 8.00 10.0 12.00 15.0 28.0 0.071429 0.00 1.0 28.0 3.464286 2.081348 0.0 1.75 4.0 5.00 7.0
A Taylor 30.0 12.333333 3.066317 8.0 10.00 12.0 14.75 20.0 30.0 0.000000 0.00 2.0 30.0 3.966667 1.973677 0.0 3.00 4.0 5.00 8.0
C Kavanagh 1.0 12.000000 NaN 12.0 12.00 12.0 12.00 12.0 1.0 0.000000 0.00 0.0 1.0 4.000000 NaN 4.0 4.00 4.0 4.00 4.0
C Pawson 24.0 12.833333 4.380457 3.0 9.75 12.0 15.00 21.0 24.0 0.125000 0.00 1.0 24.0 3.916667 2.339020 0.0 2.75 3.5 5.00 11.0
G Scott 8.0 11.750000 3.011881 8.0 8.75 12.5 14.25 15.0 8.0 0.250000 0.25 1.0 8.0 2.625000 1.187735 1.0 2.00 2.5 3.00 5.0
J Moss 30.0 11.733333 3.805018 5.0 9.00 11.0 14.00 23.0 30.0 0.000000 0.00 1.0 30.0 3.966667 2.008316 1.0 2.25 4.0 5.00 9.0
K Friend 20.0 12.750000 4.050666 7.0 10.75 11.5 14.25 24.0 20.0 0.050000 0.00 1.0 20.0 4.700000 2.154555 1.0 3.00 4.5 6.00 9.0
L Mason 19.0 10.842105 3.041622 5.0 8.50 11.0 13.00 16.0 19.0 0.052632 0.00 1.0 19.0 3.526316 2.412201 0.0 2.00 3.0 5.50 8.0
L Probert 4.0 8.000000 3.559026 3.0 6.75 9.0 10.25 11.0 4.0 0.000000 0.00 0.0 4.0 3.000000 2.449490 1.0 1.00 2.5 4.50 6.0
M Atkinson 27.0 11.851852 3.404790 4.0 9.00 13.0 14.00 18.0 27.0 0.111111 0.00 1.0 27.0 3.222222 2.241794 0.0 1.50 3.0 4.50 8.0
M Clattenburg 22.0 9.954545 3.415333 4.0 8.00 9.5 11.75 17.0 22.0 0.045455 0.00 1.0 22.0 3.272727 1.608958 0.0 2.00 3.0 4.00 6.0
M Dean 30.0 12.100000 2.733572 7.0 10.00 12.0 13.00 17.0 30.0 0.100000 0.00 1.0 30.0 3.666667 1.768173 1.0 2.00 3.0 5.00 7.0
M Jones 18.0 12.333333 3.447079 4.0 9.75 13.0 14.75 17.0 18.0 0.000000 0.00 1.0 18.0 3.388889 2.173067 0.0 2.00 4.0 4.75 8.0
M Oliver 33.0 10.787879 3.434727 5.0 8.00 10.0 14.00 19.0 33.0 0.030303 0.00 1.0 33.0 3.333333 1.947220 0.0 2.00 3.0 5.00 7.0
N Swarbrick 24.0 11.708333 3.263489 6.0 9.75 11.0 14.00 18.0 24.0 0.041667 0.00 1.0 24.0 3.708333 1.398109 1.0 3.00 4.0 5.00 6.0
P Tierney 10.0 12.200000 4.211096 3.0 11.00 14.0 14.75 16.0 10.0 0.100000 0.00 1.0 10.0 2.900000 1.286684 1.0 2.00 3.0 3.00 5.0
R East 11.0 12.090909 2.879394 10.0 10.00 11.0 13.00 19.0 11.0 0.090909 0.00 1.0 11.0 3.272727 1.902152 0.0 2.00 3.0 5.00 6.0
R Madley 30.0 13.433333 3.720246 4.0 11.25 14.0 15.75 20.0 30.0 0.000000 0.00 1.0 30.0 4.133333 1.814374 1.0 3.00 4.0 5.75 8.0
S Attwell 10.0 9.600000 3.835507 5.0 6.25 9.5 12.75 16.0 10.0 0.000000 0.00 0.0 10.0 3.000000 1.943651 0.0 2.00 2.5 3.75 7.0
l Mason 1.0 10.000000 NaN 10.0 10.00 10.0 10.00 10.0 1.0 0.000000 0.00 0.0 1.0 5.000000 NaN 5.0 5.00 5.0 5.00 5.0

20 rows × 72 columns

There is plenty going on here, so while you may want to look through everything yourself, you can also select particular columns:

In [8]:
#Let's analyse yellow cards

groupedRefs.describe()['TotalYellows']
Out[8]:
count mean std min 25% 50% 75% max
Referee
A Marriner 28.0 3.464286 2.081348 0.0 1.75 4.0 5.00 7.0
A Taylor 30.0 3.966667 1.973677 0.0 3.00 4.0 5.00 8.0
C Kavanagh 1.0 4.000000 NaN 4.0 4.00 4.0 4.00 4.0
C Pawson 24.0 3.916667 2.339020 0.0 2.75 3.5 5.00 11.0
G Scott 8.0 2.625000 1.187735 1.0 2.00 2.5 3.00 5.0
J Moss 30.0 3.966667 2.008316 1.0 2.25 4.0 5.00 9.0
K Friend 20.0 4.700000 2.154555 1.0 3.00 4.5 6.00 9.0
L Mason 19.0 3.526316 2.412201 0.0 2.00 3.0 5.50 8.0
L Probert 4.0 3.000000 2.449490 1.0 1.00 2.5 4.50 6.0
M Atkinson 27.0 3.222222 2.241794 0.0 1.50 3.0 4.50 8.0
M Clattenburg 22.0 3.272727 1.608958 0.0 2.00 3.0 4.00 6.0
M Dean 30.0 3.666667 1.768173 1.0 2.00 3.0 5.00 7.0
M Jones 18.0 3.388889 2.173067 0.0 2.00 4.0 4.75 8.0
M Oliver 33.0 3.333333 1.947220 0.0 2.00 3.0 5.00 7.0
N Swarbrick 24.0 3.708333 1.398109 1.0 3.00 4.0 5.00 6.0
P Tierney 10.0 2.900000 1.286684 1.0 2.00 3.0 3.00 5.0
R East 11.0 3.272727 1.902152 0.0 2.00 3.0 5.00 6.0
R Madley 30.0 4.133333 1.814374 1.0 3.00 4.0 5.75 8.0
S Attwell 10.0 3.000000 1.943651 0.0 2.00 2.5 3.75 7.0
l Mason 1.0 5.000000 NaN 5.0 5.00 5.0 5.00 5.0

So Mason gives the highest on average, but only officiates one game. Of our most utilised officials, Friend is the most likely to have given a booking, with 4.7 per game. Pawson, however, had the busiest game, with 11 yellows in a single match. If you’re interested in which game it was, check below.

In [9]:
df[df['TotalYellows']==11]
Out[9]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
171 26/12/2016 Burnley Middlesbrough C Pawson 17 18 35 6 5 11 0 0 0

Who would have thought Burnley v Middlesbrough would’ve been a dirty game?! No festive spirit in this Boxing Day scrap, either way.

This game wasn’t even the one with the most fouls…

In [10]:
df[df['TotalFouls']==38]
Out[10]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
129 27/11/2016 Watford Stoke R Madley 24 14 38 3 0 3 1 0 1

38 fouls, 3 yellows and a red between Mazzarri’s Watford and Stoke. Probably not a classic match! (It finished 1-0 to Stoke if you’re curious).

Summary

In this section, we have seen how using the ‘.describe()’ function makes getting summary statistics for a dataset really easy.

We were able to get results about our data in general, but then get more detailed insights by using ‘.groupby()’ to group our data by referee.

You might want to take a look at our visualisation topics to see how we can put data into charts, or see even more Pandas methods in this section.