Note: Apologies for the table formatting in this article. They’ll be fixed soon, but for now, hopefully the code and visualisations will explain what we are learning here!

Looking for things that cause other things is one of the most common investigations into data. While correlation (a relationship between variables) does not equal cause, it will often point you in the right direction and help to aid your understanding of the relationships in your data set.

You can calculate the correlation for every variable against every other variable, but this is a lengthy and inefficient process with large amounts of data. In these cases, seaborn gives us a function to visualise correlations. We can then focus our investigations onto what is interesting from this.

Let’s get our modules imported, a dataset of player attributes ready to go and we can take a look at what the correlations.

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("../../Data/FIFAPlayers.csv")

data.head(2)

Out[1]:

	player_api_id	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	finishing	heading_accuracy	short_passing	…	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes	player_name	birthday	p_id	height	weight
0	307224	64	68	right	medium	low	44	63	73	49	…	12	12	7	11	12	Kevin Koubemba	23/03/1993 00:00	307224	193.04	198
1	512726	63	72	right	medium	medium	51	66	55	57	…	11	12	12	12	7	Yanis Mbombo Lokwa	08/04/1994 00:00	512726	177.80	172

2 rows × 44 columns

Our data has lots of columns that are not attribute ratings, so let’s .drop() these from our dataset.

In [2]:

data = data.drop(["player_api_id","preferred_foot","attacking_work_rate","defensive_work_rate","player_name","birthday",
                 "p_id","height","weight"],axis=1)

data.head(2)

Out[2]:

	overall_rating	potential	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	…	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	64	68	44	63	73	49	52	52	42	31	…	55	65	22	22	25	12	12	7	11	12
1	63	72	51	66	55	57	60	64	50	39	…	48	59	15	16	12	11	12	12	12	7

2 rows × 35 columns

Now we have 35 columns, and a row for each player.

As mentioned, we want to see the correlation between the variables. Knowing these correlations might help us to uncover relationships that help us to better understand our data in the real world.

DataFrames can calculate the correlations really easy using the ‘.corr()’ method. Let’s see what that gives us:

In [3]:

data.corr()

Out[3]:

	overall_rating	potential	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	…	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
overall_rating	1.000000	0.812783	0.289583	0.260644	0.241417	0.409090	0.298047	0.282968	0.322068	0.286211	…	0.415812	0.272004	0.122715	0.148644	0.126426	0.023669	0.025814	0.022068	0.024864	0.025722
potential	0.812783	1.000000	0.240445	0.247969	0.172263	0.374329	0.245980	0.318962	0.267889	0.204292	…	0.355629	0.217775	0.066954	0.094283	0.080880	-0.021628	-0.021715	-0.024431	-0.024717	-0.020982
crossing	0.289583	0.240445	1.000000	0.612054	0.405483	0.806751	0.657961	0.836678	0.824757	0.737226	…	0.683588	0.626716	0.281258	0.330608	0.315528	-0.664204	-0.660893	-0.657448	-0.664633	-0.666669
finishing	0.260644	0.247969	0.612054	1.000000	0.392763	0.624952	0.876722	0.797271	0.722722	0.666326	…	0.695608	0.812296	-0.239855	-0.177554	-0.218832	-0.535227	-0.528907	-0.529505	-0.535533	-0.535879
heading_accuracy	0.241417	0.172263	0.405483	0.392763	1.000000	0.581259	0.405780	0.459740	0.359974	0.334387	…	0.235034	0.475056	0.490770	0.515285	0.478135	-0.737197	-0.734731	-0.729859	-0.734677	-0.736483
short_passing	0.409090	0.374329	0.806751	0.624952	0.581259	1.000000	0.660882	0.825260	0.769699	0.723058	…	0.740271	0.664610	0.387261	0.451524	0.410343	-0.744615	-0.740532	-0.735705	-0.742899	-0.742561
volleys	0.298047	0.245980	0.657961	0.876722	0.405780	0.660882	1.000000	0.790252	0.772782	0.716469	…	0.711074	0.799573	-0.144672	-0.078617	-0.115459	-0.543064	-0.539224	-0.537686	-0.544100	-0.543293
dribbling	0.282968	0.318962	0.836678	0.797271	0.459740	0.825260	0.790252	1.000000	0.831146	0.731280	…	0.736945	0.738639	0.095979	0.159148	0.130645	-0.729865	-0.725150	-0.722204	-0.728617	-0.729229
curve	0.322068	0.267889	0.824757	0.722722	0.359974	0.769699	0.772782	0.831146	1.000000	0.838848	…	0.742520	0.728738	0.092416	0.154444	0.124954	-0.601724	-0.597503	-0.595797	-0.603641	-0.603702
free_kick_accuracy	0.286211	0.204292	0.737226	0.666326	0.334387	0.723058	0.716469	0.731280	0.838848	1.000000	…	0.715175	0.723361	0.110824	0.173130	0.136704	-0.552613	-0.548308	-0.545713	-0.552341	-0.552423
long_passing	0.390576	0.323141	0.744002	0.436469	0.459398	0.883278	0.504547	0.677695	0.688117	0.677657	…	0.679710	0.517182	0.492372	0.547532	0.513989	-0.616808	-0.612804	-0.607938	-0.613013	-0.614926
ball_control	0.369979	0.364856	0.829437	0.751184	0.592930	0.908914	0.761881	0.929211	0.820620	0.741906	…	0.740002	0.747634	0.244372	0.307811	0.269671	-0.796325	-0.791156	-0.786738	-0.793834	-0.795043
acceleration	0.172540	0.294898	0.604123	0.533289	0.175644	0.486635	0.500103	0.711492	0.550935	0.419734	…	0.433572	0.429630	-0.027721	0.002464	0.010518	-0.474127	-0.470427	-0.466134	-0.474372	-0.472675
sprint_speed	0.184337	0.302025	0.580925	0.506892	0.252968	0.472873	0.474557	0.678581	0.510566	0.377297	…	0.379296	0.410285	0.016285	0.045523	0.051699	-0.495136	-0.490773	-0.487010	-0.496836	-0.495392
agility	0.213297	0.265458	0.638343	0.581714	0.097808	0.541594	0.575744	0.730591	0.639356	0.529638	…	0.578416	0.491535	-0.084819	-0.046347	-0.046212	-0.416865	-0.413718	-0.412907	-0.417915	-0.418175
reactions	0.812882	0.610956	0.302747	0.289203	0.207219	0.390003	0.333682	0.287788	0.339246	0.309461	…	0.443083	0.292896	0.083667	0.116702	0.094353	0.004749	0.007232	0.003514	0.008106	0.006593
balance	0.090975	0.157920	0.600476	0.454938	0.036127	0.502635	0.467108	0.636712	0.572920	0.487575	…	0.508085	0.414281	0.021879	0.053601	0.062066	-0.420497	-0.416812	-0.411571	-0.419347	-0.420619
shot_power	0.340539	0.271335	0.693072	0.761025	0.587076	0.757514	0.779821	0.780808	0.748280	0.728748	…	0.650626	0.762647	0.153358	0.215236	0.173882	-0.676710	-0.674647	-0.670248	-0.677315	-0.678294
jumping	0.233181	0.151305	0.042990	0.009348	0.278075	0.079908	0.021161	0.044958	0.000967	-0.034224	…	-0.019001	0.028358	0.194274	0.184933	0.199548	-0.076742	-0.075963	-0.074961	-0.072947	-0.072454
stamina	0.254705	0.224281	0.639279	0.429187	0.549530	0.673812	0.450170	0.636785	0.542305	0.479625	…	0.456012	0.447388	0.455449	0.498886	0.476347	-0.668173	-0.667285	-0.660257	-0.666142	-0.669817
strength	0.216657	0.053890	-0.141812	-0.093178	0.464772	0.023449	-0.077781	-0.161178	-0.160488	-0.111278	…	-0.154167	-0.020508	0.318745	0.321753	0.289880	-0.078627	-0.077770	-0.080239	-0.079420	-0.078672
long_shots	0.322930	0.263214	0.733510	0.835431	0.426175	0.754340	0.844590	0.824242	0.822106	0.802593	…	0.753813	0.785956	0.016019	0.085704	0.044695	-0.598275	-0.593260	-0.591472	-0.598032	-0.597834
aggression	0.267311	0.136795	0.398475	0.138608	0.665951	0.533263	0.205226	0.326667	0.291623	0.300151	…	0.212868	0.262368	0.693004	0.720424	0.694169	-0.569414	-0.569710	-0.562646	-0.565521	-0.567155
interceptions	0.203556	0.113544	0.336912	-0.166087	0.485038	0.461231	-0.061924	0.157309	0.172199	0.195544	…	0.107490	0.017175	0.920332	0.932888	0.918318	-0.449759	-0.452921	-0.442787	-0.446243	-0.447873
positioning	0.272853	0.252169	0.745236	0.880549	0.432832	0.726891	0.846754	0.870291	0.789513	0.705545	…	0.749408	0.787595	-0.073362	-0.004290	-0.039903	-0.623406	-0.618075	-0.615414	-0.622558	-0.623916
vision	0.415812	0.355629	0.683588	0.695608	0.235034	0.740271	0.711074	0.736945	0.742520	0.715175	…	1.000000	0.660357	-0.008086	0.062337	0.022254	-0.414011	-0.409150	-0.405800	-0.413931	-0.413986
penalties	0.272004	0.217775	0.626716	0.812296	0.475056	0.664610	0.799573	0.738639	0.728738	0.723361	…	0.660357	1.000000	-0.049964	0.009501	-0.033576	-0.585485	-0.579903	-0.578311	-0.586871	-0.586875
marking	0.122715	0.066954	0.281258	-0.239855	0.490770	0.387261	-0.144672	0.095979	0.092416	0.110824	…	-0.008086	-0.049964	1.000000	0.962230	0.964033	-0.445132	-0.448957	-0.440554	-0.442125	-0.443661
standing_tackle	0.148644	0.094283	0.330608	-0.177554	0.515285	0.451524	-0.078617	0.159148	0.154444	0.173130	…	0.062337	0.009501	0.962230	1.000000	0.972040	-0.489806	-0.491665	-0.484632	-0.486144	-0.488189
sliding_tackle	0.126426	0.080880	0.315528	-0.218832	0.478135	0.410343	-0.115459	0.130645	0.124954	0.136704	…	0.022254	-0.033576	0.964033	0.972040	1.000000	-0.457093	-0.459105	-0.451598	-0.453518	-0.455968
gk_diving	0.023669	-0.021628	-0.664204	-0.535227	-0.737197	-0.744615	-0.543064	-0.729865	-0.601724	-0.552613	…	-0.414011	-0.585485	-0.445132	-0.489806	-0.457093	1.000000	0.965387	0.960186	0.966969	0.971412
gk_handling	0.025814	-0.021715	-0.660893	-0.528907	-0.734731	-0.740532	-0.539224	-0.725150	-0.597503	-0.548308	…	-0.409150	-0.579903	-0.448957	-0.491665	-0.459105	0.965387	1.000000	0.957973	0.965273	0.965426
gk_kicking	0.022068	-0.024431	-0.657448	-0.529505	-0.729859	-0.735705	-0.537686	-0.722204	-0.595797	-0.545713	…	-0.405800	-0.578311	-0.440554	-0.484632	-0.451598	0.960186	0.957973	1.000000	0.959491	0.960523
gk_positioning	0.024864	-0.024717	-0.664633	-0.535533	-0.734677	-0.742899	-0.544100	-0.728617	-0.603641	-0.552341	…	-0.413931	-0.586871	-0.442125	-0.486144	-0.453518	0.966969	0.965273	0.959491	1.000000	0.967059
gk_reflexes	0.025722	-0.020982	-0.666669	-0.535879	-0.736483	-0.742561	-0.543293	-0.729229	-0.603702	-0.552423	…	-0.413986	-0.586875	-0.443661	-0.488189	-0.455968	0.971412	0.965426	0.960523	0.967059	1.000000

35 rows × 35 columns

We get 35 rows and 35 columns – one of each for each variable. The values show the correlation score between the row and column at each point. Values will range from 1 (very strong positve correlation, as one goes up, the other tends to, too) to -1 (very strong negative correlation, one goes up will tend to push the other down, or vice-versa), via 0 (no relationship).

So looking at our table, the correlation score (proper name: r-squared) between curve and crossing is 0.8, suggesting a strong relationship. We would expect this, if you can curve the ball, you tend to be able to cross.

Additionally, heading accuracy has no real relationship (0.17) with potential ability. So, if like me, you are awful in the air, you can still make it!

Looking through lots of numbers is pretty draining – so let’s visualise this table. with a ‘.heatmap’:

In [4]:

fig, ax = plt.subplots()
fig.set_size_inches(14, 10)

ax=sns.heatmap(data.corr())

There is a lot happening here, and we wouldn’t try to present insights with this, but we can still learn something from it.

Clearly, goalkeepers are not rated for their outfield ability! There is negative correlation between the GK skills and outfield skills – as shown by the streaks of black and purple.

Simiarly, we can see negative correlation between strength and acceleration and agility. Got a strong player? They are unlikely to be quick or agile. If you can find one that is, they should command a decent fee due to their unique abilities!

Summary

In a page, we have been able to take a big dataset and try to ascertain relationships within it. By using ‘.corr()’ and ‘.heatmap()’ we create numerical and graphical charts that easily illustrate the data.

With our example, we spotted how stronger players usually have a lack of pace and agility. Also looking at the chart above, reactions seems to be the best indicator of overall rating. Maybe being a talented player isn’t about just being quick, or scoring from 35 yards, maybe reading the game is the key!

Next up, take a different look at plotting relationships between variables with scatter plots, or read up on correlation as a whole.

	overall_rating	potential	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	…	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	64	68	44	63	73	49	52	52	42	31	…	55	65	22	22	25	12	12	7	11	12
1	63	72	51	66	55	57	60	64	50	39	…	48	59	15	16	12	11	12	12	12	7

	overall_rating	potential	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	…	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	64	68	44	63	73	49	52	52	42	31	…	55	65	22	22	25	12	12	7	11	12
1	63	72	51	66	55	57	60	64	50	39	…	48	59	15	16	12	11	12	12	12	7

	overall_rating	potential	crossing	finishing	heading_accuracy	short_passing	volleys	dribbling	curve	free_kick_accuracy	…	vision	penalties	marking	standing_tackle	sliding_tackle	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes
0	64	68	44	63	73	49	52	52	42	31	…	55	65	22	22	25	12	12	7	11	12
1	63	72	51	66	55	57	60	64	50	39	…	48	59	15	16	12	11	12	12	12	7