This article will take you through just a few of the methods that we have to describe our dataset. Let’s get started by firing up a season-long dataset of referees and their cards given in each game last season.
import numpy as np
import pandas as pd
df = pd.read_csv("../data/Refs.csv")
#Use '.head()' to see the top rows and check out the structure
df.head()
#Let's change those shorthand column titles to something more intuitive
df.columns = ['Date','HomeTeam','AwayTeam',
'Referee','HomeFouls','AwayFouls',
'TotalFouls','HomeYellows','AwayYellows',
'TotalYellows', 'HomeReds','AwayReds','TotalReds']
df.head(2)
#And do we have a complete set of 380 matches? len() will tell us.
len(df)
Descriptive statistics
The easiest way to produce an en-masse summary of our dataset is with the ‘.describe()’ method.
This will give us a whole new table of statistics for each numerical column:
- Count – how many values are there?
- Mean – what is the mean average? (Sum of values/count of values)
- STD – what is the standard deviation? This number describes how widely the group differs around the average. If we have a normal distribution, 68% of our values will be within one STD either side of the average.
- Min – the smallest value in our array
- 25%/50%/75% – what value accounts for 25%/50%/75% of the data?
- Max – the highest value in our array
df.describe()
So what do we learn? On average, we have between 3 and 4 yellows in a game and that the away team are only slightly more likely to get more cards. Fouls and red cards are also very close between both teams.
In 68% of games, we expect between 1.5 and 5.5 yellow cards.
At least one game had 38 fouls. That’s roughly one every two and a half minutes!
Describing with groups
Our describe table above is great for a broad brushstroke, but it would be helpful to look at our referees individually. Let’s use .groupby() to create a dataset grouped by the ‘Referee’ column
groupedRefs = df.groupby("Referee")
We can now apply some operations to check out our data by referee:
#All averavges
groupedRefs.describe()
There is plenty going on here, so while you may want to look through everything yourself, you can also select particular columns:
#Let's analyse yellow cards
groupedRefs.describe()['TotalYellows']
So Mason gives the highest on average, but only officiates one game. Of our most utilised officials, Friend is the most likely to have given a booking, with 4.7 per game. Pawson, however, had the busiest game, with 11 yellows in a single match. If you’re interested in which game it was, check below.
df[df['TotalYellows']==11]
Who would have thought Burnley v Middlesbrough would’ve been a dirty game?! No festive spirit in this Boxing Day scrap, either way.
This game wasn’t even the one with the most fouls…
df[df['TotalFouls']==38]
38 fouls, 3 yellows and a red between Mazzarri’s Watford and Stoke. Probably not a classic match! (It finished 1-0 to Stoke if you’re curious).
Summary
In this section, we have seen how using the ‘.describe()’ function makes getting summary statistics for a dataset really easy.
We were able to get results about our data in general, but then get more detailed insights by using ‘.groupby()’ to group our data by referee.
You might want to take a look at our visualisation topics to see how we can put data into charts, or see even more Pandas methods in this section.