Boxplots are a relatively common chart type used to show distribution of numeric variables. The box itself will display the middle 50% of values, with a line showing the median value. The whiskers of the box show the highest and lowest values, excluding any outliers.
This article will plot some data series of a teams’ player ages. This should allow us to compare the age profiles of teams quite easily and spot teams with young or aging squads.
Let’s get our modules imported along with a data frame of player information.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("../../Data/Violin.csv", encoding = "ISO-8859-1")
data.head()
Here we have a dataset of Chinese Super League players. We are looking to plot the players’ ages, grouped by their team – this will give us a violin for each team.
Seaborn’s ‘.boxplot()’ will make these plots very easy. We need to give it three arguments to start with:
- X – What are we grouping or data by? In this case, it is by teams.
- Y – What metric are we looking to learn about? For now, it is the players’ ages.
- Data – Where is our data kept?
Let’s plot our first set of boxes:
sns.boxplot(x="Team", y="Age", data=data)
Very nice! Loads to improve on, but a good start! We can see the middle 50% of each team’s values, the highest and lowest values and also any outliers.
Firstly, this is a bit small, so let’s use matplotlib to resize the plot area and re-plot:
#Set up our plot area, then define the size
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)
sns.boxplot(x="Team", y="Age", data=data)
Now we can see some different shapes much easier – but we can’t see which team is which! Let’s re-plot, but rotate the x axis labels:
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)
ax = sns.boxplot(x="Team", y="Age", data=data)
plt.xticks(rotation=65)
Much better! Now we can see that Changchun have a much younger squad than Guangzhou Evergrande. It is important to remember that this doesn’t tell you which is better – but this information is valuable for squad planning, player acquisition and many other areas within a team.
Let’s make this chart even better looking with club colours:
#Create a list of colours, in order of our teams on the plot)
CSLcols = ("#FF0000", "#9A050A", "#112987", "#00A4FA", "#FF6600", "#008040", "#004EA1", "#5B0CB3", "#E50211", "#FF0000",
"#00519A", "#75A315", "#E70008", "#E40000", "#C80815", "#FF3300")
#Create the palette with 'sns.color_palette()' and pass our list as an argument
CSLpalette = sns.color_palette(CSLcols)
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)
#Add an extra argument, our new palette
ax = sns.boxplot(x="Team", y="Age", data=data, palette = CSLpalette )
plt.xticks(rotation=65)
Great effort, that looks so much better! Now our viewers can easily pick out their own teams.
Summary
This article tells us how to use box plots to quickly plot a big dataset to compare distributions across categories. Seaborn’s ‘boxplot()’ command makes it easy to draw, then customise the plots.
One shortcoming in boxplots is that we cannot see exactly how many values there are ay each point – the boxes and lines are just suggestive, all sorts of patterns can be hidng in them. You might want to take a look at violin plots for a way of getting around this.
Next up, take a look at other visualisation types – or learn how to scrape data so that you can look at other leagues!