Python’s data visualisation libraries are great for exploratory and descriptive data analysis. When you have a new dataset, you may want to look at relationships en masse and then drilldown into something that you find particularly interesting. Python’s Seaborn module’s ‘.pairplot’ is one way to carry out your initial look at your data. This example takes a look at a few columns from a fantasy footall dataset, edited from here).
Plug in our modules, fire up the dataset and see what we’re dealing with.
import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline import pandas as pd import numpy as np
data = pd.read_csv("../../Data/Fantasy_Football.csv") data.head()
So we have a row for each player, containing their names, team and some numerical data including squad number, cost, selection and points.
These numbers, while readable individually, are impossible to read and make much use out of beyond a line-by-line understanding. Seaborn’s ‘.pairplot()’ allows us to take in a huge amount of data and see any relationships and the spread of each data point. It will take each numerical column, put them on both the x and y axes and plot a a scatter plot where they meet. Where the same variables meet, we get a histogram that shows the distribution of our variables. Let’s check out the default plot:
So this is a lot of data to look at. While it is very useful, it can be quite overwhelming. Let’s use the ‘vars’ argument within pairplot to focus on a few variables.
We’ll also change our scatterplot to a regression type with ‘kind’, so that we can see the regression model that Seaborn would create if we were to use a reg plot. Now we’ll be able to better see any relationships:
sns.pairplot(data, vars=["now_cost","selected_by_percent","total_points"], kind="reg") plt.show()
That looks much more manageable! See how easy it is to create a complicated plot, that tells us a lot about our data very quickly? We can now see that most players are picked by nobody/very few people, and that the clearest relationship is between popularity and points – as we’d probably expect. Perhaps less predictably, the relationship between points and cost is comparatively weak.
This sets us up for a more comprehensive look at fantasy football, but hopefully this article goes to show how easy it can be to knock together an exploratory data visualisation with Seaborn’s pairplot. There are many more arguments that we could pass to improve this, from the colour (“hue=’position’, for example), to other types of plots within our pairplot. Take a look at the docs to find out all of your options.
After your exploratory analysis, you might want to check out our describing datasets article to go further!