Seaborn

Exploratory Python Data Visualisation with Pairplot

Python’s data visualisation libraries are great for exploratory and descriptive data analysis. When you have a new dataset, you may want to look at relationships en masse and then drilldown into something that you find particularly interesting. Python’s Seaborn module’s ‘.pairplot’ is one way to carry out your initial look at your data. This example takes a look at a few columns from a fantasy footall dataset, edited from here).

Plug in our modules, fire up the dataset and see what we’re dealing with.

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import pandas as pd
import numpy as np
In [2]:
data = pd.read_csv("../../Data/Fantasy_Football.csv")
data.head()
Out[2]:
web_name team name first_name second_name squad_number now_cost selected_by_percent total_points points_per_game minutes bonus
0 Ospina Arsenal David Ospina 13 48 0.2 0 0.0 0 0
1 Cech Arsenal Petr Cech 33 54 5.2 63 3.9 1440 4
2 Martinez Arsenal Damian Emiliano Martinez 26 40 0.6 0 0.0 0 0
3 Koscielny Arsenal Laurent Koscielny 6 60 1.4 48 3.7 1164 6
4 Mertesacker Arsenal Per Mertesacker 4 48 0.5 14 3.5 333 2

So we have a row for each player, containing their names, team and some numerical data including squad number, cost, selection and points.

These numbers, while readable individually, are impossible to read and make much use out of beyond a line-by-line understanding. Seaborn’s ‘.pairplot()’ allows us to take in a huge amount of data and see any relationships and the spread of each data point. It will take each numerical column, put them on both the x and y axes and plot a a scatter plot where they meet. Where the same variables meet, we get a histogram that shows the distribution of our variables. Let’s check out the default plot:

In [3]:
sns.pairplot(data)
plt.show()

So this is a lot of data to look at. While it is very useful, it can be quite overwhelming. Let’s use the ‘vars’ argument within pairplot to focus on a few variables.

We’ll also change our scatterplot to a regression type with ‘kind’, so that we can see the regression model that Seaborn would create if we were to use a reg plot. Now we’ll be able to better see any relationships:

In [4]:
sns.pairplot(data, vars=["now_cost","selected_by_percent","total_points"],  kind="reg")
plt.show()

That looks much more manageable! See how easy it is to create a complicated plot, that tells us a lot about our data very quickly? We can now see that most players are picked by nobody/very few people, and that the clearest relationship is between popularity and points – as we’d probably expect. Perhaps less predictably, the relationship between points and cost is comparatively weak.

Summary

This sets us up for a more comprehensive look at fantasy football, but hopefully this article goes to show how easy it can be to knock together an exploratory data visualisation with Seaborn’s pairplot. There are many more arguments that we could pass to improve this, from the colour (“hue=’position’, for example), to other types of plots within our pairplot. Take a look at the docs to find out all of your options.

After your exploratory analysis, you might want to check out our describing datasets article to go further!

Posted by FCPythonADMIN in Visualisation

Football Heatmaps with Seaborn

Football heatmaps are used by in-club and media analysts to illustrate the area within which a player has been present. They might illustrate player location, or the events of a player or team and are effectively a smoothed out scatter plot of these points. While there may be some debate as to how much they are useful (they don’t tell you if actions/movement are a good or bad thing!), they can often be very aesthetically pleasing and engaging, hence their popularity. This article will take you through loading your dataset and plotting a heatmap around x & y coordinates in Python.

Let’s get our modules imported and our data ready to go!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Arc
import seaborn as sns

%matplotlib inline

data = pd.read_csv("Data/passes.csv")

data.head()
Out[1]:
Half Time Event Player Team Xstart Ystart Xend Yend
0 First Half 1 Pass Wombech USA 26 38 66 52
1 First Half 6 Pass Wombech USA 81 34 62 68
2 First Half 6 Pass Wombech USA 46 45 84 63
3 First Half 8 Pass Wombech USA 89 66 89 39
4 First Half 9 Pass Wombech USA 68 64 21 25

Plotting a heatmap

Today’s dataset showcases Wombech’s passes from her match. As you can see, we have time, player and location data. We will be looking to plot the starting X/Y coordinates of Wombech’s passes, but you would be able to do the same with any coordinates that you have – whether GPS/optical tracking coordinates, or other event data.

Python’s Seaborn module makes plotting a tidy dataset incredibly easy with ‘.kdeplot()’. Plotting it with simply the x and y coordinate columns as arguments will give you something like this:

In [2]:
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)


sns.kdeplot(data["Xstart"],data["Ystart"])
plt.show()

Cool, we have a contour plot, which groups lines closer to eachother where we have more density in our data.

Let’s customise this with a couple of additional arguments:

  • shade: fills the gaps between the lines to give us more of the heatmap effect that we are looking for.
  • n_levels: draws more lines – adding lots of these will blur the lines into a heatmap

Take a look at the examples below to see the differences these two arguments produce:

In [3]:
fig, ax = plt.subplots()
fig.set_size_inches(14,4)

#Plot one - include shade
plt.subplot(121)
sns.kdeplot(data["Xstart"],data["Ystart"], shade="True")

#Plot two - no shade, lines only
plt.subplot(122)
sns.kdeplot(data["Xstart"],data["Ystart"])

plt.show()
In [4]:
fig, ax = plt.subplots()
fig.set_size_inches(14,4)

#Plot One - distinct areas with few lines
plt.subplot(121)
sns.kdeplot(data["Xstart"],data["Ystart"], shade="True", n_levels=5)

#Plot Two - fade lines with more of them
plt.subplot(122)
sns.kdeplot(data["Xstart"],data["Ystart"], shade="True", n_levels=40)

plt.show()

Now that we can customise our plot as we see fit, we just need to add our pitch map. Learn more about plotting pitches here, but feel free to use this pitch map below – although you may need to change the coordinates to fit your data!

Also take note of our xlim and ylim lines – we use these to set the size of the plot, so that the heatmap does not spill over the pitch.

In [5]:
#Create figure
fig=plt.figure()
fig.set_size_inches(7, 5)
ax=fig.add_subplot(1,1,1)

#Pitch Outline & Centre Line
plt.plot([0,0],[0,90], color="black")
plt.plot([0,130],[90,90], color="black")
plt.plot([130,130],[90,0], color="black")
plt.plot([130,0],[0,0], color="black")
plt.plot([65,65],[0,90], color="black")

#Left Penalty Area
plt.plot([16.5,16.5],[65,25],color="black")
plt.plot([0,16.5],[65,65],color="black")
plt.plot([16.5,0],[25,25],color="black")

#Right Penalty Area
plt.plot([130,113.5],[65,65],color="black")
plt.plot([113.5,113.5],[65,25],color="black")
plt.plot([113.5,130],[25,25],color="black")

#Left 6-yard Box
plt.plot([0,5.5],[54,54],color="black")
plt.plot([5.5,5.5],[54,36],color="black")
plt.plot([5.5,0.5],[36,36],color="black")

#Right 6-yard Box
plt.plot([130,124.5],[54,54],color="black")
plt.plot([124.5,124.5],[54,36],color="black")
plt.plot([124.5,130],[36,36],color="black")

#Prepare Circles
centreCircle = plt.Circle((65,45),9.15,color="black",fill=False)
centreSpot = plt.Circle((65,45),0.8,color="black")
leftPenSpot = plt.Circle((11,45),0.8,color="black")
rightPenSpot = plt.Circle((119,45),0.8,color="black")

#Draw Circles
ax.add_patch(centreCircle)
ax.add_patch(centreSpot)
ax.add_patch(leftPenSpot)
ax.add_patch(rightPenSpot)

#Prepare Arcs
leftArc = Arc((11,45),height=18.3,width=18.3,angle=0,theta1=310,theta2=50,color="black")
rightArc = Arc((119,45),height=18.3,width=18.3,angle=0,theta1=130,theta2=230,color="black")

#Draw Arcs
ax.add_patch(leftArc)
ax.add_patch(rightArc)

#Tidy Axes
plt.axis('off')

sns.kdeplot(data["Xstart"],data["Ystart"], shade=True,n_levels=50)
plt.ylim(0, 90)
plt.xlim(0, 130)


#Display Pitch
plt.show()

Great work, now we can see Wombech’s pass locations as a heatmap!

Summary

Seaborn makes heatmaps a breeze – we simply use the contour plots with ‘kdeplot()’ and blur our lines to give a heatmap effect.

If using these to communicate rather than analyse, always take care. There is nothing telling you if the actions in the plot are good or bad, but we may make these inferences when discussing them. As always, be sure that what you think is being communicated is actually being communicated!

As for next steps, why not take a look at pass maps, or other parts of our visualisation series?

Posted by FCPythonADMIN in Visualisation

Boxplots in Seaborn

Boxplots are a relatively common chart type used to show distribution of numeric variables. The box itself will display the middle 50% of values, with a line showing the median value. The whiskers of the box show the highest and lowest values, excluding any outliers.

This article will plot some data series of a teams’ player ages. This should allow us to compare the age profiles of teams quite easily and spot teams with young or aging squads.

Let’s get our modules imported along with a data frame of player information.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


data = pd.read_csv("../../Data/Violin.csv", encoding = "ISO-8859-1")

data.head()
Out[1]:
Number Player Born Age Market value Team
0 1 Junling Yan Keeper Jan 28, 1991 26 £510k Shanghai SIPG
1 22 Le Sun Keeper Sep 17, 1989 27 £43k Shanghai SIPG
2 34 Wei Chen Keeper Feb 14, 1998 19 £21k Shanghai SIPG
3 35 Xiaodong Shi Keeper Feb 26, 1997 20 £21k Shanghai SIPG
4 16 Ricardo Carvalho Centre-Back May 18, 1978 38 £340k Shanghai SIPG

Here we have a dataset of Chinese Super League players. We are looking to plot the players’ ages, grouped by their team – this will give us a violin for each team.

Seaborn’s ‘.boxplot()’ will make these plots very easy. We need to give it three arguments to start with:

  • X – What are we grouping or data by? In this case, it is by teams.
  • Y – What metric are we looking to learn about? For now, it is the players’ ages.
  • Data – Where is our data kept?

Let’s plot our first set of boxes:

In [2]:
sns.boxplot(x="Team", y="Age", data=data)
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eaba980f0>

Very nice! Loads to improve on, but a good start! We can see the middle 50% of each team’s values, the highest and lowest values and also any outliers.

Firstly, this is a bit small, so let’s use matplotlib to resize the plot area and re-plot:

In [3]:
#Set up our plot area, then define the size
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

sns.boxplot(x="Team", y="Age", data=data)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x20eac0a7c88>

Now we can see some different shapes much easier – but we can’t see which team is which! Let’s re-plot, but rotate the x axis labels:

In [4]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

ax = sns.boxplot(x="Team", y="Age", data=data)
plt.xticks(rotation=65)
Out[4]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

Much better! Now we can see that Changchun have a much younger squad than Guangzhou Evergrande. It is important to remember that this doesn’t tell you which is better – but this information is valuable for squad planning, player acquisition and many other areas within a team.

Let’s make this chart even better looking with club colours:

In [5]:
#Create a list of colours, in order of our teams on the plot)
CSLcols = ("#FF0000", "#9A050A", "#112987", "#00A4FA", "#FF6600", "#008040", "#004EA1", "#5B0CB3", "#E50211", "#FF0000", 
           "#00519A",  "#75A315", "#E70008", "#E40000", "#C80815", "#FF3300")

#Create the palette with 'sns.color_palette()' and pass our list as an argument
CSLpalette = sns.color_palette(CSLcols)

fig, ax = plt.subplots()
fig.set_size_inches(14, 5)

#Add an extra argument, our new palette
ax = sns.boxplot(x="Team", y="Age", data=data, palette = CSLpalette )
plt.xticks(rotation=65)
Out[5]:
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 <a list of 16 Text xticklabel objects>)

Great effort, that looks so much better! Now our viewers can easily pick out their own teams.

Summary

This article tells us how to use box plots to quickly plot a big dataset to compare distributions across categories. Seaborn’s ‘boxplot()’ command makes it easy to draw, then customise the plots.

One shortcoming in boxplots is that we cannot see exactly how many values there are ay each point – the boxes and lines are just suggestive, all sorts of patterns can be hidng in them. You might want to take a look at violin plots for a way of getting around this.

Next up, take a look at other visualisation types – or learn how to scrape data so that you can look at other leagues!

Posted by FCPythonADMIN in Visualisation

Looking for Correlations with Heatmaps in Seaborn

Note: Apologies for the table formatting in this article. They’ll be fixed soon, but for now, hopefully the code and visualisations will explain what we are learning here!

Looking for things that cause other things is one of the most common investigations into data. While correlation (a relationship between variables) does not equal cause, it will often point you in the right direction and help to aid your understanding of the relationships in your data set.

You can calculate the correlation for every variable against every other variable, but this is a lengthy and inefficient process with large amounts of data. In these cases, seaborn gives us a function to visualise correlations. We can then focus our investigations onto what is interesting from this.

Let’s get our modules imported, a dataset of player attributes ready to go and we can take a look at what the correlations.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

data = pd.read_csv("../../Data/FIFAPlayers.csv")

data.head(2)
Out[1]:
player_api_id overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday p_id height weight
0 307224 64 68 right medium low 44 63 73 49 12 12 7 11 12 Kevin Koubemba 23/03/1993 00:00 307224 193.04 198
1 512726 63 72 right medium medium 51 66 55 57 11 12 12 12 7 Yanis Mbombo Lokwa 08/04/1994 00:00 512726 177.80 172

2 rows × 44 columns

Our data has lots of columns that are not attribute ratings, so let’s .drop() these from our dataset.

In [2]:
data = data.drop(["player_api_id","preferred_foot","attacking_work_rate","defensive_work_rate","player_name","birthday",
                 "p_id","height","weight"],axis=1)

data.head(2)
Out[2]:
overall_rating potential crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
0 64 68 44 63 73 49 52 52 42 31 55 65 22 22 25 12 12 7 11 12
1 63 72 51 66 55 57 60 64 50 39 48 59 15 16 12 11 12 12 12 7

2 rows × 35 columns

Now we have 35 columns, and a row for each player.

As mentioned, we want to see the correlation between the variables. Knowing these correlations might help us to uncover relationships that help us to better understand our data in the real world.

DataFrames can calculate the correlations really easy using the ‘.corr()’ method. Let’s see what that gives us:

In [3]:
data.corr()
Out[3]:
overall_rating potential crossing finishing heading_accuracy short_passing volleys dribbling curve free_kick_accuracy vision penalties marking standing_tackle sliding_tackle gk_diving gk_handling gk_kicking gk_positioning gk_reflexes
overall_rating 1.000000 0.812783 0.289583 0.260644 0.241417 0.409090 0.298047 0.282968 0.322068 0.286211 0.415812 0.272004 0.122715 0.148644 0.126426 0.023669 0.025814 0.022068 0.024864 0.025722
potential 0.812783 1.000000 0.240445 0.247969 0.172263 0.374329 0.245980 0.318962 0.267889 0.204292 0.355629 0.217775 0.066954 0.094283 0.080880 -0.021628 -0.021715 -0.024431 -0.024717 -0.020982
crossing 0.289583 0.240445 1.000000 0.612054 0.405483 0.806751 0.657961 0.836678 0.824757 0.737226 0.683588 0.626716 0.281258 0.330608 0.315528 -0.664204 -0.660893 -0.657448 -0.664633 -0.666669
finishing 0.260644 0.247969 0.612054 1.000000 0.392763 0.624952 0.876722 0.797271 0.722722 0.666326 0.695608 0.812296 -0.239855 -0.177554 -0.218832 -0.535227 -0.528907 -0.529505 -0.535533 -0.535879
heading_accuracy 0.241417 0.172263 0.405483 0.392763 1.000000 0.581259 0.405780 0.459740 0.359974 0.334387 0.235034 0.475056 0.490770 0.515285 0.478135 -0.737197 -0.734731 -0.729859 -0.734677 -0.736483
short_passing 0.409090 0.374329 0.806751 0.624952 0.581259 1.000000 0.660882 0.825260 0.769699 0.723058 0.740271 0.664610 0.387261 0.451524 0.410343 -0.744615 -0.740532 -0.735705 -0.742899 -0.742561
volleys 0.298047 0.245980 0.657961 0.876722 0.405780 0.660882 1.000000 0.790252 0.772782 0.716469 0.711074 0.799573 -0.144672 -0.078617 -0.115459 -0.543064 -0.539224 -0.537686 -0.544100 -0.543293
dribbling 0.282968 0.318962 0.836678 0.797271 0.459740 0.825260 0.790252 1.000000 0.831146 0.731280 0.736945 0.738639 0.095979 0.159148 0.130645 -0.729865 -0.725150 -0.722204 -0.728617 -0.729229
curve 0.322068 0.267889 0.824757 0.722722 0.359974 0.769699 0.772782 0.831146 1.000000 0.838848 0.742520 0.728738 0.092416 0.154444 0.124954 -0.601724 -0.597503 -0.595797 -0.603641 -0.603702
free_kick_accuracy 0.286211 0.204292 0.737226 0.666326 0.334387 0.723058 0.716469 0.731280 0.838848 1.000000 0.715175 0.723361 0.110824 0.173130 0.136704 -0.552613 -0.548308 -0.545713 -0.552341 -0.552423
long_passing 0.390576 0.323141 0.744002 0.436469 0.459398 0.883278 0.504547 0.677695 0.688117 0.677657 0.679710 0.517182 0.492372 0.547532 0.513989 -0.616808 -0.612804 -0.607938 -0.613013 -0.614926
ball_control 0.369979 0.364856 0.829437 0.751184 0.592930 0.908914 0.761881 0.929211 0.820620 0.741906 0.740002 0.747634 0.244372 0.307811 0.269671 -0.796325 -0.791156 -0.786738 -0.793834 -0.795043
acceleration 0.172540 0.294898 0.604123 0.533289 0.175644 0.486635 0.500103 0.711492 0.550935 0.419734 0.433572 0.429630 -0.027721 0.002464 0.010518 -0.474127 -0.470427 -0.466134 -0.474372 -0.472675
sprint_speed 0.184337 0.302025 0.580925 0.506892 0.252968 0.472873 0.474557 0.678581 0.510566 0.377297 0.379296 0.410285 0.016285 0.045523 0.051699 -0.495136 -0.490773 -0.487010 -0.496836 -0.495392
agility 0.213297 0.265458 0.638343 0.581714 0.097808 0.541594 0.575744 0.730591 0.639356 0.529638 0.578416 0.491535 -0.084819 -0.046347 -0.046212 -0.416865 -0.413718 -0.412907 -0.417915 -0.418175
reactions 0.812882 0.610956 0.302747 0.289203 0.207219 0.390003 0.333682 0.287788 0.339246 0.309461 0.443083 0.292896 0.083667 0.116702 0.094353 0.004749 0.007232 0.003514 0.008106 0.006593
balance 0.090975 0.157920 0.600476 0.454938 0.036127 0.502635 0.467108 0.636712 0.572920 0.487575 0.508085 0.414281 0.021879 0.053601 0.062066 -0.420497 -0.416812 -0.411571 -0.419347 -0.420619
shot_power 0.340539 0.271335 0.693072 0.761025 0.587076 0.757514 0.779821 0.780808 0.748280 0.728748 0.650626 0.762647 0.153358 0.215236 0.173882 -0.676710 -0.674647 -0.670248 -0.677315 -0.678294
jumping 0.233181 0.151305 0.042990 0.009348 0.278075 0.079908 0.021161 0.044958 0.000967 -0.034224 -0.019001 0.028358 0.194274 0.184933 0.199548 -0.076742 -0.075963 -0.074961 -0.072947 -0.072454
stamina 0.254705 0.224281 0.639279 0.429187 0.549530 0.673812 0.450170 0.636785 0.542305 0.479625 0.456012 0.447388 0.455449 0.498886 0.476347 -0.668173 -0.667285 -0.660257 -0.666142 -0.669817
strength 0.216657 0.053890 -0.141812 -0.093178 0.464772 0.023449 -0.077781 -0.161178 -0.160488 -0.111278 -0.154167 -0.020508 0.318745 0.321753 0.289880 -0.078627 -0.077770 -0.080239 -0.079420 -0.078672
long_shots 0.322930 0.263214 0.733510 0.835431 0.426175 0.754340 0.844590 0.824242 0.822106 0.802593 0.753813 0.785956 0.016019 0.085704 0.044695 -0.598275 -0.593260 -0.591472 -0.598032 -0.597834
aggression 0.267311 0.136795 0.398475 0.138608 0.665951 0.533263 0.205226 0.326667 0.291623 0.300151 0.212868 0.262368 0.693004 0.720424 0.694169 -0.569414 -0.569710 -0.562646 -0.565521 -0.567155
interceptions 0.203556 0.113544 0.336912 -0.166087 0.485038 0.461231 -0.061924 0.157309 0.172199 0.195544 0.107490 0.017175 0.920332 0.932888 0.918318 -0.449759 -0.452921 -0.442787 -0.446243 -0.447873
positioning 0.272853 0.252169 0.745236 0.880549 0.432832 0.726891 0.846754 0.870291 0.789513 0.705545 0.749408 0.787595 -0.073362 -0.004290 -0.039903 -0.623406 -0.618075 -0.615414 -0.622558 -0.623916
vision 0.415812 0.355629 0.683588 0.695608 0.235034 0.740271 0.711074 0.736945 0.742520 0.715175 1.000000 0.660357 -0.008086 0.062337 0.022254 -0.414011 -0.409150 -0.405800 -0.413931 -0.413986
penalties 0.272004 0.217775 0.626716 0.812296 0.475056 0.664610 0.799573 0.738639 0.728738 0.723361 0.660357 1.000000 -0.049964 0.009501 -0.033576 -0.585485 -0.579903 -0.578311 -0.586871 -0.586875
marking 0.122715 0.066954 0.281258 -0.239855 0.490770 0.387261 -0.144672 0.095979 0.092416 0.110824 -0.008086 -0.049964 1.000000 0.962230 0.964033 -0.445132 -0.448957 -0.440554 -0.442125 -0.443661
standing_tackle 0.148644 0.094283 0.330608 -0.177554 0.515285 0.451524 -0.078617 0.159148 0.154444 0.173130 0.062337 0.009501 0.962230 1.000000 0.972040 -0.489806 -0.491665 -0.484632 -0.486144 -0.488189
sliding_tackle 0.126426 0.080880 0.315528 -0.218832 0.478135 0.410343 -0.115459 0.130645 0.124954 0.136704 0.022254 -0.033576 0.964033 0.972040 1.000000 -0.457093 -0.459105 -0.451598 -0.453518 -0.455968
gk_diving 0.023669 -0.021628 -0.664204 -0.535227 -0.737197 -0.744615 -0.543064 -0.729865 -0.601724 -0.552613 -0.414011 -0.585485 -0.445132 -0.489806 -0.457093 1.000000 0.965387 0.960186 0.966969 0.971412
gk_handling 0.025814 -0.021715 -0.660893 -0.528907 -0.734731 -0.740532 -0.539224 -0.725150 -0.597503 -0.548308 -0.409150 -0.579903 -0.448957 -0.491665 -0.459105 0.965387 1.000000 0.957973 0.965273 0.965426
gk_kicking 0.022068 -0.024431 -0.657448 -0.529505 -0.729859 -0.735705 -0.537686 -0.722204 -0.595797 -0.545713 -0.405800 -0.578311 -0.440554 -0.484632 -0.451598 0.960186 0.957973 1.000000 0.959491 0.960523
gk_positioning 0.024864 -0.024717 -0.664633 -0.535533 -0.734677 -0.742899 -0.544100 -0.728617 -0.603641 -0.552341 -0.413931 -0.586871 -0.442125 -0.486144 -0.453518 0.966969 0.965273 0.959491 1.000000 0.967059
gk_reflexes 0.025722 -0.020982 -0.666669 -0.535879 -0.736483 -0.742561 -0.543293 -0.729229 -0.603702 -0.552423 -0.413986 -0.586875 -0.443661 -0.488189 -0.455968 0.971412 0.965426 0.960523 0.967059 1.000000

35 rows × 35 columns

We get 35 rows and 35 columns – one of each for each variable. The values show the correlation score between the row and column at each point. Values will range from 1 (very strong positve correlation, as one goes up, the other tends to, too) to -1 (very strong negative correlation, one goes up will tend to push the other down, or vice-versa), via 0 (no relationship).

So looking at our table, the correlation score (proper name: r-squared) between curve and crossing is 0.8, suggesting a strong relationship. We would expect this, if you can curve the ball, you tend to be able to cross.

Additionally, heading accuracy has no real relationship (0.17) with potential ability. So, if like me, you are awful in the air, you can still make it!

Looking through lots of numbers is pretty draining – so let’s visualise this table. with a ‘.heatmap’:

In [4]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 10)

ax=sns.heatmap(data.corr())

There is a lot happening here, and we wouldn’t try to present insights with this, but we can still learn something from it.

Clearly, goalkeepers are not rated for their outfield ability! There is negative correlation between the GK skills and outfield skills – as shown by the streaks of black and purple.

Simiarly, we can see negative correlation between strength and acceleration and agility. Got a strong player? They are unlikely to be quick or agile. If you can find one that is, they should command a decent fee due to their unique abilities!

Summary

In a page, we have been able to take a big dataset and try to ascertain relationships within it. By using ‘.corr()’ and ‘.heatmap()’ we create numerical and graphical charts that easily illustrate the data.

With our example, we spotted how stronger players usually have a lack of pace and agility. Also looking at the chart above, reactions seems to be the best indicator of overall rating. Maybe being a talented player isn’t about just being quick, or scoring from 35 yards, maybe reading the game is the key!

Next up, take a different look at plotting relationships between variables with scatter plots, or read up on correlation as a whole.

Posted by FCPythonADMIN in Visualisation

Scatter Plots in Seaborn

Scatter plots are fantastic visualisations for showing the relationship between variables. They plot two series of data, one across each axis, which allow for a quick look to check for any relationship.

Seaborn allows us to make really nice-looking visuals with little effort once our data is ready. Let’s get our modules and data fired up and kick off.

In [1]:
import seaborn as sns
import pandas as pd
%matplotlib inline

df = pd.read_csv("../../Data/FIFAPlayers.csv")

df.head(2)
Out[1]:
player_api_id overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing finishing heading_accuracy short_passing gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday p_id height weight
0 307224 64 68 right medium low 44 63 73 49 12 12 7 11 12 Kevin Koubemba 23/03/1993 00:00 307224 193.04 198
1 512726 63 72 right medium medium 51 66 55 57 11 12 12 12 7 Yanis Mbombo Lokwa 08/04/1994 00:00 512726 177.80 172

2 rows × 44 columns

Our data shows skill ratings across a number of attributes for lots and lots of players. In this article, we want to try and ascertain some relationships between this attributes.

Seaborn has a few ways to show scatter plots, and we'll focus on 'regplot()'. Let's start with a plot that should show a strong positive correlation - height and weight.
In [2]:
sns.regplot(x="height",y="weight",data=df)
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c879492828>