Pandas is not only a fantastic module and community around manipulating our datasets, it also gives tools for analysing and describing our data.

This article will take you through just a few of the methods that we have to describe our dataset. Let’s get started by firing up a season-long dataset of referees and their cards given in each game last season.

In [1]:

import numpy as np
import pandas as pd

df = pd.read_csv("../data/Refs.csv")

In [2]:

#Use '.head()' to see the top rows and check out the structure

df.head()

Out[2]:

	Date	HomeTeam	AwayTeam	Referee	HF	AF	TF	HY	AY	TY
0	13/08/2016	Burnley	Swansea	J Moss	10	14	24	3	2	5
1	13/08/2016	Crystal Palace	West Brom	C Pawson	12	15	27	2	2	4
2	13/08/2016	Everton	Tottenham	M Atkinson	10	14	24	0	0	0
3	13/08/2016	Hull	Leicester	M Dean	8	17	25	2	2	4
4	13/08/2016	Man City	Sunderland	R Madley	11	14	25	1	2	3

In [3]:

#Let's change those shorthand column titles to something more intuitive

df.columns = ['Date','HomeTeam','AwayTeam',
                 'Referee','HomeFouls','AwayFouls',
                 'TotalFouls','HomeYellows','AwayYellows',
                'TotalYellows', 'HomeReds','AwayReds','TotalReds']

df.head(2)

Out[3]:

	Date	HomeTeam	AwayTeam	Referee	HomeFouls	AwayFouls	TotalFouls	HomeYellows	AwayYellows	TotalYellows	HomeReds	AwayReds	TotalReds
0	13/08/2016	Burnley	Swansea	J Moss	10	14	24	3	2	5	0	0	0
1	13/08/2016	Crystal Palace	West Brom	C Pawson	12	15	27	2	2	4	0	0	0

In [4]:

#And do we have a complete set of 380 matches? len() will tell us.

len(df)

Out[4]:

Descriptive statistics

The easiest way to produce an en-masse summary of our dataset is with the ‘.describe()’ method.

This will give us a whole new table of statistics for each numerical column:

Count – how many values are there?
Mean – what is the mean average? (Sum of values/count of values)
STD – what is the standard deviation? This number describes how widely the group differs around the average. If we have a normal distribution, 68% of our values will be within one STD either side of the average.
Min – the smallest value in our array
25%/50%/75% – what value accounts for 25%/50%/75% of the data?
Max – the highest value in our array

In [5]:

df.describe()

Out[5]:

	HomeFouls	AwayFouls	TotalFouls	HomeYellows	AwayYellows	TotalYellows	HomeReds	AwayReds	TotalReds
count	380.000000	380.000000	380.000000	380.000000	380.000000	380.000000	380.000000	380.000000	380.000000
mean	11.063158	11.657895	22.721053	1.744737	1.886842	3.631579	0.055263	0.052632	0.107895
std	3.468853	3.556378	5.176166	1.256371	1.301647	1.969705	0.240050	0.223591	0.319037
min	2.000000	3.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	9.000000	9.000000	19.000000	1.000000	1.000000	2.000000	0.000000	0.000000	0.000000
50%	11.000000	12.000000	23.000000	2.000000	2.000000	4.000000	0.000000	0.000000	0.000000
75%	13.000000	14.000000	25.000000	3.000000	3.000000	5.000000	0.000000	0.000000	0.000000
max	24.000000	24.000000	38.000000	6.000000	6.000000	11.000000	2.000000	1.000000	2.000000

So what do we learn? On average, we have between 3 and 4 yellows in a game and that the away team are only slightly more likely to get more cards. Fouls and red cards are also very close between both teams.

In 68% of games, we expect between 1.5 and 5.5 yellow cards.

At least one game had 38 fouls. That’s roughly one every two and a half minutes!

Describing with groups

Our describe table above is great for a broad brushstroke, but it would be helpful to look at our referees individually. Let’s use .groupby() to create a dataset grouped by the ‘Referee’ column

In [6]:

groupedRefs = df.groupby("Referee")

We can now apply some operations to check out our data by referee:

In [7]:

#All averavges

groupedRefs.describe()

Out[7]:

	AwayFouls								AwayReds		…	TotalReds		TotalYellows
	count	mean	std	min	25%	50%	75%	max	count	mean	…	75%	max	count	mean	std	min	25%	50%	75%	max
Referee
A Marriner	28.0	9.857143	2.977873	5.0	8.00	10.0	12.00	15.0	28.0	0.071429	…	0.00	1.0	28.0	3.464286	2.081348	0.0	1.75	4.0	5.00	7.0
A Taylor	30.0	12.333333	3.066317	8.0	10.00	12.0	14.75	20.0	30.0	0.000000	…	0.00	2.0	30.0	3.966667	1.973677	0.0	3.00	4.0	5.00	8.0
C Kavanagh	1.0	12.000000	NaN	12.0	12.00	12.0	12.00	12.0	1.0	0.000000	…	0.00	0.0	1.0	4.000000	NaN	4.0	4.00	4.0	4.00	4.0
C Pawson	24.0	12.833333	4.380457	3.0	9.75	12.0	15.00	21.0	24.0	0.125000	…	0.00	1.0	24.0	3.916667	2.339020	0.0	2.75	3.5	5.00	11.0
G Scott	8.0	11.750000	3.011881	8.0	8.75	12.5	14.25	15.0	8.0	0.250000	…	0.25	1.0	8.0	2.625000	1.187735	1.0	2.00	2.5	3.00	5.0
J Moss	30.0	11.733333	3.805018	5.0	9.00	11.0	14.00	23.0	30.0	0.000000	…	0.00	1.0	30.0	3.966667	2.008316	1.0	2.25	4.0	5.00	9.0
K Friend	20.0	12.750000	4.050666	7.0	10.75	11.5	14.25	24.0	20.0	0.050000	…	0.00	1.0	20.0	4.700000	2.154555	1.0	3.00	4.5	6.00	9.0
L Mason	19.0	10.842105	3.041622	5.0	8.50	11.0	13.00	16.0	19.0	0.052632	…	0.00	1.0	19.0	3.526316	2.412201	0.0	2.00	3.0	5.50	8.0
L Probert	4.0	8.000000	3.559026	3.0	6.75	9.0	10.25	11.0	4.0	0.000000	…	0.00	0.0	4.0	3.000000	2.449490	1.0	1.00	2.5	4.50	6.0
M Atkinson	27.0	11.851852	3.404790	4.0	9.00	13.0	14.00	18.0	27.0	0.111111	…	0.00	1.0	27.0	3.222222	2.241794	0.0	1.50	3.0	4.50	8.0
M Clattenburg	22.0	9.954545	3.415333	4.0	8.00	9.5	11.75	17.0	22.0	0.045455	…	0.00	1.0	22.0	3.272727	1.608958	0.0	2.00	3.0	4.00	6.0
M Dean	30.0	12.100000	2.733572	7.0	10.00	12.0	13.00	17.0	30.0	0.100000	…	0.00	1.0	30.0	3.666667	1.768173	1.0	2.00	3.0	5.00	7.0
M Jones	18.0	12.333333	3.447079	4.0	9.75	13.0	14.75	17.0	18.0	0.000000	…	0.00	1.0	18.0	3.388889	2.173067	0.0	2.00	4.0	4.75	8.0
M Oliver	33.0	10.787879	3.434727	5.0	8.00	10.0	14.00	19.0	33.0	0.030303	…	0.00	1.0	33.0	3.333333	1.947220	0.0	2.00	3.0	5.00	7.0
N Swarbrick	24.0	11.708333	3.263489	6.0	9.75	11.0	14.00	18.0	24.0	0.041667	…	0.00	1.0	24.0	3.708333	1.398109	1.0	3.00	4.0	5.00	6.0
P Tierney	10.0	12.200000	4.211096	3.0	11.00	14.0	14.75	16.0	10.0	0.100000	…	0.00	1.0	10.0	2.900000	1.286684	1.0	2.00	3.0	3.00	5.0
R East	11.0	12.090909	2.879394	10.0	10.00	11.0	13.00	19.0	11.0	0.090909	…	0.00	1.0	11.0	3.272727	1.902152	0.0	2.00	3.0	5.00	6.0
R Madley	30.0	13.433333	3.720246	4.0	11.25	14.0	15.75	20.0	30.0	0.000000	…	0.00	1.0	30.0	4.133333	1.814374	1.0	3.00	4.0	5.75	8.0
S Attwell	10.0	9.600000	3.835507	5.0	6.25	9.5	12.75	16.0	10.0	0.000000	…	0.00	0.0	10.0	3.000000	1.943651	0.0	2.00	2.5	3.75	7.0
l Mason	1.0	10.000000	NaN	10.0	10.00	10.0	10.00	10.0	1.0	0.000000	…	0.00	0.0	1.0	5.000000	NaN	5.0	5.00	5.0	5.00	5.0

20 rows Ã— 72 columns

There is plenty going on here, so while you may want to look through everything yourself, you can also select particular columns:

In [8]:

#Let's analyse yellow cards

groupedRefs.describe()['TotalYellows']

Out[8]:

	count	mean	std	min	25%	50%	75%	max
Referee
A Marriner	28.0	3.464286	2.081348	0.0	1.75	4.0	5.00	7.0
A Taylor	30.0	3.966667	1.973677	0.0	3.00	4.0	5.00	8.0
C Kavanagh	1.0	4.000000	NaN	4.0	4.00	4.0	4.00	4.0
C Pawson	24.0	3.916667	2.339020	0.0	2.75	3.5	5.00	11.0
G Scott	8.0	2.625000	1.187735	1.0	2.00	2.5	3.00	5.0
J Moss	30.0	3.966667	2.008316	1.0	2.25	4.0	5.00	9.0
K Friend	20.0	4.700000	2.154555	1.0	3.00	4.5	6.00	9.0
L Mason	19.0	3.526316	2.412201	0.0	2.00	3.0	5.50	8.0
L Probert	4.0	3.000000	2.449490	1.0	1.00	2.5	4.50	6.0
M Atkinson	27.0	3.222222	2.241794	0.0	1.50	3.0	4.50	8.0
M Clattenburg	22.0	3.272727	1.608958	0.0	2.00	3.0	4.00	6.0
M Dean	30.0	3.666667	1.768173	1.0	2.00	3.0	5.00	7.0
M Jones	18.0	3.388889	2.173067	0.0	2.00	4.0	4.75	8.0
M Oliver	33.0	3.333333	1.947220	0.0	2.00	3.0	5.00	7.0
N Swarbrick	24.0	3.708333	1.398109	1.0	3.00	4.0	5.00	6.0
P Tierney	10.0	2.900000	1.286684	1.0	2.00	3.0	3.00	5.0
R East	11.0	3.272727	1.902152	0.0	2.00	3.0	5.00	6.0
R Madley	30.0	4.133333	1.814374	1.0	3.00	4.0	5.75	8.0
S Attwell	10.0	3.000000	1.943651	0.0	2.00	2.5	3.75	7.0
l Mason	1.0	5.000000	NaN	5.0	5.00	5.0	5.00	5.0

So Mason gives the highest on average, but only officiates one game. Of our most utilised officials, Friend is the most likely to have given a booking, with 4.7 per game. Pawson, however, had the busiest game, with 11 yellows in a single match. If you’re interested in which game it was, check below.

In [9]:

df[df['TotalYellows']==11]

Out[9]:

	Date	HomeTeam	AwayTeam	Referee	HomeFouls	AwayFouls	TotalFouls	HomeYellows	AwayYellows	TotalYellows	HomeReds	AwayReds	TotalReds
171	26/12/2016	Burnley	Middlesbrough	C Pawson	17	18	35	6	5	11	0	0	0

Who would have thought Burnley v Middlesbrough would’ve been a dirty game?! No festive spirit in this Boxing Day scrap, either way.

This game wasn’t even the one with the most fouls…

In [10]:

df[df['TotalFouls']==38]

Out[10]:

	Date	HomeTeam	AwayTeam	Referee	HomeFouls	AwayFouls	TotalFouls	HomeYellows	AwayYellows	TotalYellows	HomeReds	AwayReds	TotalReds
129	27/11/2016	Watford	Stoke	R Madley	24	14	38	3	0	3	1	0	1

38 fouls, 3 yellows and a red between Mazzarri’s Watford and Stoke. Probably not a classic match! (It finished 1-0 to Stoke if you’re curious).

Summary

In this section, we have seen how using the ‘.describe()’ function makes getting summary statistics for a dataset really easy.

We were able to get results about our data in general, but then get more detailed insights by using ‘.groupby()’ to group our data by referee.

You might want to take a look at our visualisation topics to see how we can put data into charts, or see even more Pandas methods in this section.