correlation

Looking for Correlations with Heatmaps in Seaborn

Note: Apologies for the table formatting in this article. They’ll be fixed soon, but for now, hopefully the code and visualisations will explain what we are learning here!

Looking for things that cause other things is one of the most common investigations into data. While correlation (a relationship between variables) does not equal cause, it will often point you in the right direction and help to aid your understanding of the relationships in your data set.

You can calculate the correlation for every variable against every other variable, but this is a lengthy and inefficient process with large amounts of data. In these cases, seaborn gives us a function to visualise correlations. We can then focus our investigations onto what is interesting from this.

Let’s get our modules imported, a dataset of player attributes ready to go and we can take a look at what the correlations.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("../../Data/FIFAPlayers.csv")

data.head(2)
Out[1]:
player_api_id overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday p_id height weight
0 307224 64 68 right medium low 44 63 73 49 12 12 7 11 12 Kevin Koubemba 23/03/1993 00:00 307224 193.04 198
1 512726 63 72 right medium medium 51 66 55 57 11 12 12 12 7 Yanis Mbombo Lokwa 08/04/1994 00:00 512726 177.80 172

2 rows × 44 columns

Our data has lots of columns that are not attribute ratings, so let’s .drop() these from our dataset.

In [2]:
data = data.drop(["player_api_id","preferred_foot","attacking_work_rate","defensive_work_rate","player_name","birthday",
                 "p_id","height","weight"],axis=1)

data.head(2)
Out[2]:
overall_rating potential crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 64 68 44 63 73 49 52 52 42 31 55 65 22 22 25 12 12 7 11 12
1 63 72 51 66 55 57 60 64 50 39 48 59 15 16 12 11 12 12 12 7

2 rows × 35 columns

Now we have 35 columns, and a row for each player.

As mentioned, we want to see the correlation between the variables. Knowing these correlations might help us to uncover relationships that help us to better understand our data in the real world.

DataFrames can calculate the correlations really easy using the ‘.corr()’ method. Let’s see what that gives us:

In [3]:
data.corr()
Out[3]:
overall_rating potential crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
overall_rating 1.000000 0.812783 0.289583 0.260644 0.241417 0.409090 0.298047 0.282968 0.322068 0.286211 0.415812 0.272004 0.122715 0.148644 0.126426 0.023669 0.025814 0.022068 0.024864 0.025722
potential 0.812783 1.000000 0.240445 0.247969 0.172263 0.374329 0.245980 0.318962 0.267889 0.204292 0.355629 0.217775 0.066954 0.094283 0.080880 -0.021628 -0.021715 -0.024431 -0.024717 -0.020982
crossing 0.289583 0.240445 1.000000 0.612054 0.405483 0.806751 0.657961 0.836678 0.824757 0.737226 0.683588 0.626716 0.281258 0.330608 0.315528 -0.664204 -0.660893 -0.657448 -0.664633 -0.666669
finishing 0.260644 0.247969 0.612054 1.000000 0.392763 0.624952 0.876722 0.797271 0.722722 0.666326 0.695608 0.812296 -0.239855 -0.177554 -0.218832 -0.535227 -0.528907 -0.529505 -0.535533 -0.535879
heading_accuracy 0.241417 0.172263 0.405483 0.392763 1.000000 0.581259 0.405780 0.459740 0.359974 0.334387 0.235034 0.475056 0.490770 0.515285 0.478135 -0.737197 -0.734731 -0.729859 -0.734677 -0.736483
short_passing 0.409090 0.374329 0.806751 0.624952 0.581259 1.000000 0.660882 0.825260 0.769699 0.723058 0.740271 0.664610 0.387261 0.451524 0.410343 -0.744615 -0.740532 -0.735705 -0.742899 -0.742561
volleys 0.298047 0.245980 0.657961 0.876722 0.405780 0.660882 1.000000 0.790252 0.772782 0.716469 0.711074 0.799573 -0.144672 -0.078617 -0.115459 -0.543064 -0.539224 -0.537686 -0.544100 -0.543293
dribbling 0.282968 0.318962 0.836678 0.797271 0.459740 0.825260 0.790252 1.000000 0.831146 0.731280 0.736945 0.738639 0.095979 0.159148 0.130645 -0.729865 -0.725150 -0.722204 -0.728617 -0.729229
curve 0.322068 0.267889 0.824757 0.722722 0.359974 0.769699 0.772782 0.831146 1.000000 0.838848 0.742520 0.728738 0.092416 0.154444 0.124954 -0.601724 -0.597503 -0.595797 -0.603641 -0.603702
free_kick_accuracy 0.286211 0.204292 0.737226 0.666326 0.334387 0.723058 0.716469 0.731280 0.838848 1.000000 0.715175 0.723361 0.110824 0.173130 0.136704 -0.552613 -0.548308 -0.545713 -0.552341 -0.552423
long_passing 0.390576 0.323141 0.744002 0.436469 0.459398 0.883278 0.504547 0.677695 0.688117 0.677657 0.679710 0.517182 0.492372 0.547532 0.513989 -0.616808 -0.612804 -0.607938 -0.613013 -0.614926
ball_control 0.369979 0.364856 0.829437 0.751184 0.592930 0.908914 0.761881 0.929211 0.820620 0.741906 0.740002 0.747634 0.244372 0.307811 0.269671 -0.796325 -0.791156 -0.786738 -0.793834 -0.795043
acceleration 0.172540 0.294898 0.604123 0.533289 0.175644 0.486635 0.500103 0.711492 0.550935 0.419734 0.433572 0.429630 -0.027721 0.002464 0.010518 -0.474127 -0.470427 -0.466134 -0.474372 -0.472675
sprint_speed 0.184337 0.302025 0.580925 0.506892 0.252968 0.472873 0.474557 0.678581 0.510566 0.377297 0.379296 0.410285 0.016285 0.045523 0.051699 -0.495136 -0.490773 -0.487010 -0.496836 -0.495392
agility 0.213297 0.265458 0.638343 0.581714 0.097808 0.541594 0.575744 0.730591 0.639356 0.529638 0.578416 0.491535 -0.084819 -0.046347 -0.046212 -0.416865 -0.413718 -0.412907 -0.417915 -0.418175
reactions 0.812882 0.610956 0.302747 0.289203 0.207219 0.390003 0.333682 0.287788 0.339246 0.309461 0.443083 0.292896 0.083667 0.116702 0.094353 0.004749 0.007232 0.003514 0.008106 0.006593
balance 0.090975 0.157920 0.600476 0.454938 0.036127 0.502635 0.467108 0.636712 0.572920 0.487575 0.508085 0.414281 0.021879 0.053601 0.062066 -0.420497 -0.416812 -0.411571 -0.419347 -0.420619
shot_power 0.340539 0.271335 0.693072 0.761025 0.587076 0.757514 0.779821 0.780808 0.748280 0.728748 0.650626 0.762647 0.153358 0.215236 0.173882 -0.676710 -0.674647 -0.670248 -0.677315 -0.678294
jumping 0.233181 0.151305 0.042990 0.009348 0.278075 0.079908 0.021161 0.044958 0.000967 -0.034224 -0.019001 0.028358 0.194274 0.184933 0.199548 -0.076742 -0.075963 -0.074961 -0.072947 -0.072454
stamina 0.254705 0.224281 0.639279 0.429187 0.549530 0.673812 0.450170 0.636785 0.542305 0.479625 0.456012 0.447388 0.455449 0.498886 0.476347 -0.668173 -0.667285 -0.660257 -0.666142 -0.669817
strength 0.216657 0.053890 -0.141812 -0.093178 0.464772 0.023449 -0.077781 -0.161178 -0.160488 -0.111278 -0.154167 -0.020508 0.318745 0.321753 0.289880 -0.078627 -0.077770 -0.080239 -0.079420 -0.078672
long_shots 0.322930 0.263214 0.733510 0.835431 0.426175 0.754340 0.844590 0.824242 0.822106 0.802593 0.753813 0.785956 0.016019 0.085704 0.044695 -0.598275 -0.593260 -0.591472 -0.598032 -0.597834
aggression 0.267311 0.136795 0.398475 0.138608 0.665951 0.533263 0.205226 0.326667 0.291623 0.300151 0.212868 0.262368 0.693004 0.720424 0.694169 -0.569414 -0.569710 -0.562646 -0.565521 -0.567155
interceptions 0.203556 0.113544 0.336912 -0.166087 0.485038 0.461231 -0.061924 0.157309 0.172199 0.195544 0.107490 0.017175 0.920332 0.932888 0.918318 -0.449759 -0.452921 -0.442787 -0.446243 -0.447873
positioning 0.272853 0.252169 0.745236 0.880549 0.432832 0.726891 0.846754 0.870291 0.789513 0.705545 0.749408 0.787595 -0.073362 -0.004290 -0.039903 -0.623406 -0.618075 -0.615414 -0.622558 -0.623916
vision 0.415812 0.355629 0.683588 0.695608 0.235034 0.740271 0.711074 0.736945 0.742520 0.715175 1.000000 0.660357 -0.008086 0.062337 0.022254 -0.414011 -0.409150 -0.405800 -0.413931 -0.413986
penalties 0.272004 0.217775 0.626716 0.812296 0.475056 0.664610 0.799573 0.738639 0.728738 0.723361 0.660357 1.000000 -0.049964 0.009501 -0.033576 -0.585485 -0.579903 -0.578311 -0.586871 -0.586875
marking 0.122715 0.066954 0.281258 -0.239855 0.490770 0.387261 -0.144672 0.095979 0.092416 0.110824 -0.008086 -0.049964 1.000000 0.962230 0.964033 -0.445132 -0.448957 -0.440554 -0.442125 -0.443661
standing_tackle 0.148644 0.094283 0.330608 -0.177554 0.515285 0.451524 -0.078617 0.159148 0.154444 0.173130 0.062337 0.009501 0.962230 1.000000 0.972040 -0.489806 -0.491665 -0.484632 -0.486144 -0.488189
sliding_tackle 0.126426 0.080880 0.315528 -0.218832 0.478135 0.410343 -0.115459 0.130645 0.124954 0.136704 0.022254 -0.033576 0.964033 0.972040 1.000000 -0.457093 -0.459105 -0.451598 -0.453518 -0.455968
gk_diving 0.023669 -0.021628 -0.664204 -0.535227 -0.737197 -0.744615 -0.543064 -0.729865 -0.601724 -0.552613 -0.414011 -0.585485 -0.445132 -0.489806 -0.457093 1.000000 0.965387 0.960186 0.966969 0.971412
gk_handling 0.025814 -0.021715 -0.660893 -0.528907 -0.734731 -0.740532 -0.539224 -0.725150 -0.597503 -0.548308 -0.409150 -0.579903 -0.448957 -0.491665 -0.459105 0.965387 1.000000 0.957973 0.965273 0.965426
gk_kicking 0.022068 -0.024431 -0.657448 -0.529505 -0.729859 -0.735705 -0.537686 -0.722204 -0.595797 -0.545713 -0.405800 -0.578311 -0.440554 -0.484632 -0.451598 0.960186 0.957973 1.000000 0.959491 0.960523
gk_positioning 0.024864 -0.024717 -0.664633 -0.535533 -0.734677 -0.742899 -0.544100 -0.728617 -0.603641 -0.552341 -0.413931 -0.586871 -0.442125 -0.486144 -0.453518 0.966969 0.965273 0.959491 1.000000 0.967059
gk_reflexes 0.025722 -0.020982 -0.666669 -0.535879 -0.736483 -0.742561 -0.543293 -0.729229 -0.603702 -0.552423 -0.413986 -0.586875 -0.443661 -0.488189 -0.455968 0.971412 0.965426 0.960523 0.967059 1.000000

35 rows × 35 columns

We get 35 rows and 35 columns – one of each for each variable. The values show the correlation score between the row and column at each point. Values will range from 1 (very strong positve correlation, as one goes up, the other tends to, too) to -1 (very strong negative correlation, one goes up will tend to push the other down, or vice-versa), via 0 (no relationship).

So looking at our table, the correlation score (proper name: r-squared) between curve and crossing is 0.8, suggesting a strong relationship. We would expect this, if you can curve the ball, you tend to be able to cross.

Additionally, heading accuracy has no real relationship (0.17) with potential ability. So, if like me, you are awful in the air, you can still make it!

Looking through lots of numbers is pretty draining – so let’s visualise this table. with a ‘.heatmap’:

In [4]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 10)

ax=sns.heatmap(data.corr())

There is a lot happening here, and we wouldn’t try to present insights with this, but we can still learn something from it.

Clearly, goalkeepers are not rated for their outfield ability! There is negative correlation between the GK skills and outfield skills – as shown by the streaks of black and purple.

Simiarly, we can see negative correlation between strength and acceleration and agility. Got a strong player? They are unlikely to be quick or agile. If you can find one that is, they should command a decent fee due to their unique abilities!

Summary

In a page, we have been able to take a big dataset and try to ascertain relationships within it. By using ‘.corr()’ and ‘.heatmap()’ we create numerical and graphical charts that easily illustrate the data.

With our example, we spotted how stronger players usually have a lack of pace and agility. Also looking at the chart above, reactions seems to be the best indicator of overall rating. Maybe being a talented player isn’t about just being quick, or scoring from 35 yards, maybe reading the game is the key!

Next up, take a different look at plotting relationships between variables with scatter plots, or read up on correlation as a whole.

Posted by FCPythonADMIN in Visualisation