Note: Apologies for the table formatting in this article. They’ll be fixed soon, but for now, hopefully the code and visualisations will explain what we are learning here!
Looking for things that cause other things is one of the most common investigations into data. While correlation (a relationship between variables) does not equal cause, it will often point you in the right direction and help to aid your understanding of the relationships in your data set.
You can calculate the correlation for every variable against every other variable, but this is a lengthy and inefficient process with large amounts of data. In these cases, seaborn gives us a function to visualise correlations. We can then focus our investigations onto what is interesting from this.
Let’s get our modules imported, a dataset of player attributes ready to go and we can take a look at what the correlations.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("../../Data/FIFAPlayers.csv")
data.head(2)
Our data has lots of columns that are not attribute ratings, so let’s .drop() these from our dataset.
data = data.drop(["player_api_id","preferred_foot","attacking_work_rate","defensive_work_rate","player_name","birthday",
"p_id","height","weight"],axis=1)
data.head(2)
Now we have 35 columns, and a row for each player.
As mentioned, we want to see the correlation between the variables. Knowing these correlations might help us to uncover relationships that help us to better understand our data in the real world.
DataFrames can calculate the correlations really easy using the ‘.corr()’ method. Let’s see what that gives us:
data.corr()
We get 35 rows and 35 columns – one of each for each variable. The values show the correlation score between the row and column at each point. Values will range from 1 (very strong positve correlation, as one goes up, the other tends to, too) to -1 (very strong negative correlation, one goes up will tend to push the other down, or vice-versa), via 0 (no relationship).
So looking at our table, the correlation score (proper name: r-squared) between curve and crossing is 0.8, suggesting a strong relationship. We would expect this, if you can curve the ball, you tend to be able to cross.
Additionally, heading accuracy has no real relationship (0.17) with potential ability. So, if like me, you are awful in the air, you can still make it!
Looking through lots of numbers is pretty draining – so let’s visualise this table. with a ‘.heatmap’:
fig, ax = plt.subplots()
fig.set_size_inches(14, 10)
ax=sns.heatmap(data.corr())
There is a lot happening here, and we wouldn’t try to present insights with this, but we can still learn something from it.
Clearly, goalkeepers are not rated for their outfield ability! There is negative correlation between the GK skills and outfield skills – as shown by the streaks of black and purple.
Simiarly, we can see negative correlation between strength and acceleration and agility. Got a strong player? They are unlikely to be quick or agile. If you can find one that is, they should command a decent fee due to their unique abilities!
Summary
In a page, we have been able to take a big dataset and try to ascertain relationships within it. By using ‘.corr()’ and ‘.heatmap()’ we create numerical and graphical charts that easily illustrate the data.
With our example, we spotted how stronger players usually have a lack of pace and agility. Also looking at the chart above, reactions seems to be the best indicator of overall rating. Maybe being a talented player isn’t about just being quick, or scoring from 35 yards, maybe reading the game is the key!
Next up, take a different look at plotting relationships between variables with scatter plots, or read up on correlation as a whole.