boxplot

Boxplots in Seaborn

Boxplots are a relatively common chart type used to show distribution of numeric variables. The box itself will display the middle 50% of values, with a line showing the median value. The whiskers of the box show the highest and lowest values, excluding any outliers.

This article will plot some data series of a teams’ player ages. This should allow us to compare the age profiles of teams quite easily and spot teams with young or aging squads.

Let’s get our modules imported along with a data frame of player information.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


data = pd.read_csv("../../Data/Violin.csv", encoding = "ISO-8859-1")

data.head()
Out[1]:
Number Player Born Age Market value Team
0 1 Junling Yan Keeper Jan 28, 1991 26 £510k Shanghai SIPG
1 22 Le Sun Keeper Sep 17, 1989 27 £43k Shanghai SIPG
2 34 Wei Chen Keeper Feb 14, 1998 19 £21k Shanghai SIPG
3 35 Xiaodong Shi Keeper Feb 26, 1997 20 £21k Shanghai SIPG
4 16 Ricardo Carvalho Centre-Back May 18, 1978 38 £340k Shanghai SIPG

Here we have a dataset of Chinese Super League players. We are looking to plot the players’ ages, grouped by their team – this will give us a violin for each team.

Seaborn’s ‘.boxplot()’ will make these plots very easy. We need to give it three arguments to start with:

  • X – What are we grouping or data by? In this case, it is by teams.
  • Y – What metric are we looking to learn about? For now, it is the players’ ages.
  • Data – Where is our data kept?

Let’s plot our first set of boxes:

In [2]:
sns.boxplot(x="Team", y="Age", data=data)
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eaba980f0>

Very nice! Loads to improve on, but a good start! We can see the middle 50% of each team’s values, the highest and lowest values and also any outliers.

Firstly, this is a bit small, so let’s use matplotlib to resize the plot area and re-plot:

In [3]:
#Set up our plot area, then define the size
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

sns.boxplot(x="Team", y="Age", data=data)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eac0a7c88>

Now we can see some different shapes much easier – but we can’t see which team is which! Let’s re-plot, but rotate the x axis labels:

In [4]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

ax = sns.boxplot(x="Team", y="Age", data=data)
plt.xticks(rotation=65)
Out[4]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

Much better! Now we can see that Changchun have a much younger squad than Guangzhou Evergrande. It is important to remember that this doesn’t tell you which is better – but this information is valuable for squad planning, player acquisition and many other areas within a team.

Let’s make this chart even better looking with club colours:

In [5]:
#Create a list of colours, in order of our teams on the plot)
CSLcols = ("#FF0000", "#9A050A", "#112987", "#00A4FA", "#FF6600", "#008040", "#004EA1", "#5B0CB3", "#E50211", "#FF0000", 
           "#00519A",  "#75A315", "#E70008", "#E40000", "#C80815", "#FF3300")

#Create the palette with 'sns.color_palette()' and pass our list as an argument
CSLpalette = sns.color_palette(CSLcols)

fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

#Add an extra argument, our new palette
ax = sns.boxplot(x="Team", y="Age", data=data, palette = CSLpalette )
plt.xticks(rotation=65)
Out[5]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

Great effort, that looks so much better! Now our viewers can easily pick out their own teams.

Summary

This article tells us how to use box plots to quickly plot a big dataset to compare distributions across categories. Seaborn’s ‘boxplot()’ command makes it easy to draw, then customise the plots.

One shortcoming in boxplots is that we cannot see exactly how many values there are ay each point – the boxes and lines are just suggestive, all sorts of patterns can be hidng in them. You might want to take a look at violin plots for a way of getting around this.

Next up, take a look at other visualisation types – or learn how to scrape data so that you can look at other leagues!

Posted by FCPythonADMIN in Visualisation