Football commentary and discussion is so often based on putting players into boxes. “He’s the best defensive forward in the league”, “I’d say he’s more a 6 than a 4”, “He’s one of your Gerrards, your Lampards, your Scholses”. Understanding these roles and coming to a decision on which players are closely aligned to them is incredibly difficult.
If we wanted to take a data-led approach to grouping player performances, we could use a method called clustering – allowing us to group players based on a set of their metrics. In practice, this might allow us to overcome some biases when analysing players or uncover names that we might have previously thought of as playing in a different style.
In this tutorial, we will look at k-means clustering. We will use the algorithm to put players into different groups based on their shot creating actions. The data for this article can be found on fbref.com from Statsbomb if you would like to follow along.
And just in case you are only here to see the player groups, the clusters are listed at the end of the article!
The process will take the following steps:
- Check and tidy dataset
- Create k-means model and assign each player into a cluster of similar players
- Describe & visualise results
Let’s get our libraries and data imported and get started.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#Allow for full tables to be shown
pd.options.display.max_columns = None
pd.options.display.max_rows = None
data = pd.read_csv('SCA.csv')
data.head()
Check & Tidy Dataset
Looking at our dataset, we have some biographical player information, along with shot creation numbers and their type.
Right away, we have a few bits to tidy with our first three columns:
#Split the player names by the slash, and use the first one
data['Player'] = data['Player'].str.split('\\', expand=True)[0]
#Split the nation names by the space, and use the second one
data['Nation'] = data['Nation'].str.split(' ', expand=True)[1]
#Some positions have 2 (e.g. MFFW), let's just use the first two letters for now
data['Pos'] = data['Pos'].str[:2]
data.head(2)
Much nicer to read now!
One more thing to consider is the effect of playing for a stronger team. As a broad assumption, we can expect players in better teams to create more shots, and players in worse teams to produce fewer.
This might produce results that group players based on their production levels, not the styles of their productions.
As such, let’s create some new columns to look at the percentages for each action type. We’ll do this by creating a sum column, then dividing each column by the sum.
#Create list of columns to sum, then assign the sum to a new column
add_list = ['Pass SCA', 'Deadball SCA', 'Dribble SCA', 'Shot SCA', 'Fouled SCA']
data['Sum SCA'] = data[add_list].sum(axis=1)
#Create our first new column
data['Pass SCA Ratio'] = data['Pass SCA']/data['Sum SCA']
data.head()
Looking good! Scroll to the right side of the table and we can see that Tammy’s shot creations were from passes nearly 70% of the time. We could manually create the remaining four, but let’s save ourselves some time and create these in a loop.
First, we’ll create the new column names in a loop. Then we will run another loop with the code that we just used to create our remaining columns.
#Create new column names by adding ' ratio' to each name in our previous list
new_cols_list = [each + ' Ratio' for each in add_list]
#For each new column name, calculate the column exactly as we did a minute ago
for idx, val in enumerate(new_cols_list):
data[val] = data[add_list[idx]]/data['Sum SCA']
#Create a sum of the percentages to check that they all add to 1
data['Sum SCA Ratio'] = data[new_cols_list].sum(axis=1)
data.head(5)
Perfect! We have loads of decimals adding to 1 that we can consider as percentages for each type of action.
One final issue to tidy… we have Adrián, a GK, in the dataset. This won’t be too useful for our shot creation profiles.
We’ll create a new dataframe that will ask for only forwards or midfielders. Also, let’s set a floor for playing time & shots created to cut out anyone with low appearance/creation numbers. These numbers are arbitrary, so feel free to change them into something more useful!
#New dataframe where Pos == FW or MF. AND played more than 5 90s AND created more than 15 shots
data_mffw = data[((data['Pos'] == 'FW') | (data['Pos'] == 'MF')) & (data['90s'] > 5) & (data['SCA'] > 15)]
data_mffw.head()
Create k-means model and assign each player into a cluster of similar players
Now that we are happy with our dataset, we can look to get our players clustered into groups. But first, we should discuss a bit about k-means clustering.
As simply as possible, the method splits all of our players into a number of clusters that we decide.
One way that it does this is by putting the centre of the clusters somewhere at random in our data. From here, the players are assigned a cluster based on which one they are closest to.
The cluster’s location then changes to the average of its players’ datapoints and the clusters are re-assigned. This process repeats until no players change their membership after the cluster centres move to their new average. Once this process stops, we then have our final clusters!
Thankfully, we are yet again standing on the shoulders of giants, and we can implement this complexity in just a few characters.
We will use the scikit learn module that we imported at the beginning of the article. Within the module, we will use the k-means function and assign a model to the variable ‘km’ below:
km = KMeans(n_clusters=5, init='random', random_state=0)
A few bits to unpack. Let’s take each of the arguments that we have given the function:
- n_clusters=5: simply how many clusters should we create? We have chosen 5 for no particular reason, but there are ways to see how many clusters you have that are for a more in-depth piece!
- init=’random’: how should we pick where to try with our first clusters? We have selected at random and…
- random_state=0: is here to keep my random first clusters in the same place each time. You should remove this argument entirely if you want your analysis to use a ‘random’ that will change each time.
So nothing really complicated there, we are just asking for a KMeans model that will put our data into 5 clusters.
We are now ready for our km model variable to fit clusters to our data.
We are going to do this against the data_mffw dataset that we created above, but only looking at the new columns that we created with percentages. Let’s do this below, and see what the output is:
y_km = km.fit_predict(data_mffw[new_cols_list])
y_km
It may not look like much, but we have an array of cluster values for our dataset! Congratulations on building a clustering model! 💪😎
The array isn’t too useful on its own, so let’s assign them to their corresponding players by adding them as a new column:
data_mffw['Cluster'] = y_km
data_mffw.head()
Describe & Visualise Results
Right at the end, we can see the cluster for each player. Let’s check out our first group and see what’s going on:
data_mffw[data_mffw['Cluster'] == 0].head()
Lots of set-piece takers in here, along with high pass numbers and low dribbles (except Maddison).
Compare it to the next cluster, which features fewer set piece takers and more creations from dribbles, shots and being fouled:
data_mffw[data_mffw['Cluster'] == 1].head()
The rest of these clusters and all the others can be found at the end of the article. To close this out before then, it might help our understanding to visualise them.
We can do this simply with a scatter plot featuring two columns:
#We'll do this a couple of times, let's make a function
def plotClusters(xAxis, yAxis):
plt.scatter(data_mffw[data_mffw['Cluster']==0][xAxis], data_mffw[data_mffw['Cluster']==0][yAxis], s=40, c='red', label ='Cluster 1')
plt.scatter(data_mffw[data_mffw['Cluster']==1][xAxis], data_mffw[data_mffw['Cluster']==1][yAxis], s=40, c='blue', label ='Cluster 2')
plt.scatter(data_mffw[data_mffw['Cluster']==2][xAxis], data_mffw[data_mffw['Cluster']==2][yAxis], s=40, c='green', label ='Cluster 3')
plt.scatter(data_mffw[data_mffw['Cluster']==3][xAxis], data_mffw[data_mffw['Cluster']==3][yAxis], s=40, c='pink', label ='Cluster 4')
plt.scatter(data_mffw[data_mffw['Cluster']==4][xAxis], data_mffw[data_mffw['Cluster']==4][yAxis], s=40, c='gold', label ='Cluster 5')
plt.xlabel(xAxis)
plt.ylabel(yAxis)
plt.legend()
plotClusters('Pass SCA Ratio', 'Dribble SCA Ratio')
It might look weird having some rogue players, but remember that our model considered 5 variables, whereas the visualisation only looks at 2. Check out principal component analysis if you want to work towards a visualisation that better fits this approach!
One final plot, let’s look at shots created per 90 against age. This kind of thing might help us to examine a player that we are replacing and look for younger players with similar contributions.
#Age vs number of shot creations per 90, split by cluster
plotClusters('SCA90', 'Age')
Conclusion
Congratulations on making it through a machine learning tutorial and building a model to cluster the attacking creativity of Premier League players!
There are plenty of things that would need tidying up, however. As one example in our analysis, taking set-pieces means your other actions are diminished in the percentages – we would need to solve that problem!
I hope that you have also had some ideas on how to apply this data to both football and non-football data. Whether it is scouting for transfer targets or grouping customers, kmeans can be hugely useful for you!
If you have enjoyed the tutorial, let us know on Twitter, and take a look at our other pieces on machine learning and data visualisation!
Appendix – Cluster Lists
#Set piece takers
data_mffw[data_mffw['Cluster'] == 0]
#No deadballs, passes dominate but contributions from other types. Similar to group 4 who instead feature deadbslls.
data_mffw[data_mffw['Cluster'] == 1]
#Passers for the most part. So we get loads of our more defensive midfielders in here.
#Notable exceptions being some of the City players, which is pretty interesting!
data_mffw[data_mffw['Cluster'] == 2]
#Mostly passes, but high on dribbles & shots too
data_mffw[data_mffw['Cluster'] == 3]
#As with others, mostly passes, but the most evenly spread outside of that
#Essentially group 2 with some set pieces
data_mffw[data_mffw['Cluster'] == 4]