Blog

Parsing Opta F24 Files: An Introduction to XML in Python

If you want to move past aggregate data on players and teams, you probably want to start looking at match event data. Obtaining this data can be difficult, and even when you get there, it is often in an XML file, rather than a table that you might be more comfortable with. This article will take you through parsing an Opta F24 XML file into a table containing the passes within a game.

Firstly, it is probably worth understanding what an XML file is. It is simply a text file structured to hold data. It uses things called tags to explain what bits of data it holds and organises them in a structure that makes it easy to understand and use elsewhere. If you have any experience with HTML, it is very similar. Here is a simple example showing how an XML file might look to send data about managerial changes:

<team name = "Manchester United">
    <manager name = "Jose Mourinho" start = "2016/05/27" end = "2018/12/18" />
    <manager name = "Ole Gunnar Solskjær" start = "2018/12/19" />
</team>

So here we have a hierarchy of team -> manager, with attributes within each data point such as team name or start date. Hopefully it is clear to see how this standardised method of sharing information might make it so much easier to use data in our work or share data with others – especially when data have different amounts of characteristics as we see above.

However, if we want to use lots of Python’s plotting and analysis capabilities, we will likely need this data in a table – and this is what we will work towards. We’ll take the following steps:

  1. Import modules
  2. Load our XML file
  3. Explore the Opta F24 XML structure
  4. Iterate through the match events and save pass data into lists
  5. Merge these lists together into a table

Once we get through all of these steps, we’ll have a nice table with which we can do loads of plotting and analysis with!

As a quick note, this Opta feed isn’t publicly available and is largely only found in clubs, analytics companies and media organisations. Manchester City and Opta briefly released a season’s worth of data from 2011-12, although FC Python unfortunately do not have this available.

Follow along and import the modules below to get started!

In [1]:
import csv
import xml.etree.ElementTree as et
import numpy as np
import pandas as pd
from datetime import datetime as dt

Now that we’re ready to go, the first thing that we need to do is import our XML file. The XML module that we added above makes this really simple. The two lines below will take an XML file and parse it into something that we could navigate just like we do with an object by using square brackets:

In [2]:
tree = et.ElementTree(file = "yourf24XMLfile.xml")
gameFile = tree.getroot()

Opening up the XML file in a text editor, we can see that the F24 file is structured similar to below:

<Container>
    <Game>
        <Event>
            <Event Qualifiers>
            </Event Qualifiers>
        </Event>
        ...
        <Event>
            <Event Qualifiers>
            </Event Qualifiers>
        </Event>
    </Game>
</Container>

So we have a container for all the data, then a game that holds each event. Within each event, there are attributes telling us about the event as well as qualifiers giving even more information about each events.

We will get around to making sense of the events, but let’s take a look at what information we are given about the game itself. We can do this with the ‘.attrib’ method from the XML module imported earlier. Let’s check out the attributes from the first entry into our XML file (the one containing the match events). We can do this by using square brackets to select aspecific entry:

In [3]:
gameFile[0].attrib
Out[3]:
{'away_team_id': '43',
 'away_team_name': 'Manchester City',
 'competition_id': '8',
 'competition_name': 'English Barclays Premier League',
 'game_date': '2015-09-12T15:00:00',
 'home_team_id': '31',
 'home_team_name': 'Crystal Palace',
 'id': '803206',
 'matchday': '5',
 'period_1_start': '2015-09-12T15:00:12',
 'period_2_start': '2015-09-12T16:05:00',
 'season_id': '2015',
 'season_name': 'Season 2015/2016'}

Awesome, we get loads of information about the match in a dictionary, where the data is laid out with a key, then the value. For example, we can see the team names and also their Opta IDs.

These match details could be useful if we were processing lots of events at the same time and needed to differentiate them. Let’s look at a quick example of formatting a string with the details found here:

In [4]:
#Print a string with the two teams, using %s and the attrib to dynamically fill the string
print ("{} vs {}".format(gameFile[0].attrib["home_team_name"], gameFile[0].attrib["away_team_name"]))
Out[4]:
Crystal Palace vs Manchester City

Moving onto match events, we saw in the structure of the file that the match events lie within the game details tags. Let’s use another square bracket to navigate to the first event:

In [5]:
gameFile[0][0].attrib
Out[5]:
{'event_id': '1',
 'id': '1467084299',
 'last_modified': '2015-09-16T16:50:12',
 'min': '0',
 'outcome': '1',
 'period_id': '16',
 'sec': '0',
 'team_id': '43',
 'timestamp': '2015-09-12T14:00:09.141',
 'type_id': '34',
 'version': '1442418612592',
 'x': '0.0',
 'y': '0.0'}

Looking through this, there is a lot to try and get our heads around. We can see that there are event keys like min, sec, x and y – these are quite easy to understand. But the values, like outcome: 1 and event_id: 1, don’t really make much sense by themselves. This is particularly important when it comes to teams, as we only have their ID and not their name. We’ll tidy that up soon.

This is because the Opta XML uses lots of IDs rather than names. You’ll need to find documentation from Opta (although versions of this can be Googled) to find out what all of them are. But first one’s free, and our event_id: 1 is pass – as you’d probably expect for the first event from the game.

You might also remember that our events contained qualifiers. Let’s again use square brackets to pull the first one out for the event above:

In [6]:
gameFile[0][0][0].attrib
Out[6]:
{'id': '1784607081',
 'qualifier_id': '44',
 'value': '1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, 5'}

We don’t need to know what this means, but it is useful to understand the structure of the file as we will be going on to iterate through each event and qualifier to turn it into a table for further analysis.

At its simplest, all we are going to do is loop through each of the events that we have identified above, identify the passes and take the details that we want from each. These details will go into different lists for different data categories (player, team, success, etc.). We will then put these lists into a table which is then ready for analysis, plotting or exporting.

Firstly though, I’d like to have my team names come through to make this a bit more readble, while than the team ID. The events only carry the team ID, so let’s create a dictionary that will allow us to later swap the ID for the team name:

In [7]:
team_dict = {gameFile[0].attrib["home_team_id"]: gameFile[0].attrib["home_team_name"],
            gameFile[0].attrib["away_team_id"]: gameFile[0].attrib["away_team_name"]}


print(team_dict)
{'31': 'Crystal Palace', '43': 'Manchester City'}

For this tutorial, we’re simply going to take the x/y locations of the pass origin and destination, the time of the pass, the team and whether or not it was successful.

There’s so much more that we could take, such as the players, the pass length or any other details that you spot in the XML. If you’d like to pull those out too, doing so will make a great extension to this tutorial!

We’re going to start by creating the empty lists for our data:

In [8]:
#Create empty lists for the 8 columns we're collecting data for
x_origin = []
y_origin = []
x_destination = []
y_destination = []
outcome = []
minute = []
half = []
team = []

The main part of the tutorial sees us going event-by-event and adding our desired details only when the event is a pass. To do this, we will use a for loop on each event, and when the event is a pass (id = 1), we will append the correct attribute to our lists created above. Some of these details are hidden in the qualifiers, so we’ll also iterate over those to get the information needed there.

This tutorial probably isn’t the best place to go through the intricacies of the feed, so take a look at the docs if you’re interested.

Follow the code below with comments on each of the above steps:

In [9]:
#Iterate through each game in our file - we only have one
for game in gameFile:
    
    #Iterate through each event
    for event in game:
        
        #If the event is a pass (ID = 1)
        if event.attrib.get("type_id") == '1':
            
            #To the correct list, append the correct attribute using attrib.get()
            x_origin.append(event.attrib.get("x"))
            y_origin.append(event.attrib.get("y"))
            outcome.append(event.attrib.get("outcome"))
            minute.append(event.attrib.get("min"))
            half.append(event.attrib.get("period_id"))
            team.append(team_dict[event.attrib.get("team_id")])
            
            #Iterate through each qualifier 
            for qualifier in event:
                
                #If the qualifier is relevant, append the information to the x or y destination lists
                if qualifier.attrib.get("qualifier_id") == "140":
                    x_destination.append(qualifier.attrib.get("value"))
                if qualifier.attrib.get("qualifier_id") == "141":
                    y_destination.append(qualifier.attrib.get("value"))

If this has worked correctly, we should have 8 lists populated. Let’s check out the minutes list:

In [10]:
print("The list is " + str(len(minute)) + " long and the 43rd entry is " + minute[42])
The list is 956 long and the 43rd entry is 2

You can check out each list in more detail, but they should work just fine.

Our final task is to create a table for our data from our lists. To do this, we just need to create a list of our column headers, then assign the list to each one. We’ll then flip our table to make it long, rather than wide – just like you would want to see in a spreadsheet. Let’s take a look:

In [11]:
#Create a list of our 8 columns/lists
column_titles = ["team", "half", "min", "x_origin", "y_origin", "x_destination", "y_destination", "outcome"]
            
#Use pd.DataFrame to create our table, assign the data in the order of our columns and give it the column titles above
final_table = pd.DataFrame(data=[team, half, minute, x_origin, y_origin, x_destination, y_destination, outcome], index=column_titles)

#Transpose, or flip, the table. Otherwise, our table will run from left to right, rather than top to bottom
final_table = final_table.T

#Show us the top 5 rows of the table
final_table.head()
Out[11]:
team half min x_origin y_origin x_destination y_destination outcome
0 Manchester City 1 0 50.0 50.0 52.2 50.7 1
1 Manchester City 1 0 52.2 50.7 46.7 50.4 1
2 Manchester City 1 0 46.8 51.2 27.1 68.2 1
3 Manchester City 1 0 29.2 71.2 28.3 92.9 1
4 Manchester City 1 0 29.5 94.2 56.9 95.3 1

So this is great for passes, and the same logic would apply for shots, fouls or even all events at the same time – just expand on the above with the relevant IDs from the Opta docs. And analysts, if you’re still struggling to get it done, the emergency loan window is always open!

Now that we’ve taken a complex XML and parsed the passes into a table, there’s a number of things that we can do. We could put the table into a wider dataset, do some analysis of these passes, visualise straight away or just export our new table to a csv:

In [12]:
final_table.to_csv("pass_data.csv", index=False)

Heatmap – code taken from the FC Python tutorial

Heatmap from f24 feed

Passmap – full tutorial here.

Pass map from f24 feed

Summary

In this tutorial, we’ve learned a bit about XML structures and the Opta F24 XML specifically. We have seen how to import them into Python and parse them into empty lists. With these now-full lists, we have gone on to pull these into a single table. From here, it is much easier to run our analysis, plot data or do whatever else we like. The further beauty of this comes in automating your analysis for future games and giving yourself hours of time each week.

Huge credit belongs to a number of sources that helped with this piece, including Imran Khan, FC R Stats and plenty of other posts that take a look at the feed.

With your newfound data from the Opta F24 XML, why not practice your visualisation skills with the data? Check out our collection of visualisation tutorials here.

Posted by FCPythonADMIN in Blog

How much does it cost to fill the Panini World Cup album? Simulations in Python

With the World Cup just 3 months away, the best bit of the tournament build up is upon us – the Panini sticker album.

For those looking to invest in a completed album to pass onto grandchildren, just how much will you have to spend to complete it on your own? Assuming that each sticker has an equal chance of being found, this is a simple random number problem that we can recreate in Python.

This article will show you how to create a function that allows you to estimate how much you will need to spend, before you throw wads of cash at sticker boxes to end with a half-finished album. Load up pandas and numpy and let’s kick on.

In [1]:
import pandas as pd
import numpy as np

To solve this, we are going to recreate our sticker album. It will be an empty list that will take on the new stickers that we find in each pack.

We will also need a few variables to act as counters alongside this list:

  • Stickers needed
  • How many packets have we bought?
  • How many swaps do we have?

Let’s define these:

In [1]:
stickersNeeded = 682
packetsBought = 0
stickersGot = []
swapStickers = 0

Now, we need to run a simulation that will open packs, check each sticker and either add it to our album or to our swaps pile.

We will do this by running a while loop that completes once the album is full.

This loop will open a pack of 5 stickers and check whether or not it is featured in the album already. To simulate the sticker, we will simply assign it a random number within the album. If this number is already present, we add it to the swap pile. If it is a new sticker, we append it to our album list.

We will also need to update our counters for packets bought, stickers needed and swaps throughout.

Pretty simple process overall! Let’s take a look at how we implement this loop:

In [2]:
while stickersNeeded > 0:
    
        #Buy a new packet
        packetsBought += 1

        #For each sticker, do some things 
        for i in range(0,5):
            
            #Assign the sticker a random number
            stickerNumber = np.random.randint(0,681)
    
            #Check if we have the sticker
            if stickerNumber not in stickersGot:
                
                #Add it to the album, then reduce our stickers needed count
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            #Throw it into the swaps pile
            else:
                swapStickers += 1

Each time you run that, you are simulating the entire album completion process! Let’s check out the results:

In [3]:
{"Packets":packetsBought,"Swaps":swapStickers}
Out[3]:
{'Packets': 939, 'Swaps': 4013}

939 packets?! 4013 swaps?! Surely these must be outliers… let’s add all of this into one function and run it loads of times over.

As the number of stickers in a pack and the sticker total may change, let’s define these as arguments that we can change with future uses of the function:

In [4]:
def calculateAlbum(stickersInPack = 5, costOfPackp = 80, stickerTotal=682):
    stickersNeeded = stickerTotal
    packetsBought = 0
    stickersGot = []
    swapStickers = 0


    while stickersNeeded > 0:
        packetsBought += 1

        for i in range(0,stickersInPack):
            stickerNumber = np.random.randint(0,stickerTotal)

            if stickerNumber not in stickersGot:
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            else:
                swapStickers += 1

    return{"Packets":packetsBought,"Swaps":swapStickers,
           "Total Cost":(packetsBought*costOfPackp)/100}
In [5]:
calculateAlbum()
Out[5]:
{'Packets': 1017, 'Swaps': 4403, 'Total Cost': 813.6}

So our calculateAlbum function does exactly the same as our instructions before, we have just added a total cost.

Let’s run this 1000 times over and see what we can truly expect if we want to complete the album:

In [6]:
a=0
b=0
c=0

for i in range(0, 1000):
    a += calculateAlbum()["Packets"]
    b += calculateAlbum()["Swaps"]
    c += calculateAlbum()["Total Cost"]

{"Packets":a/1000,"Swaps":b/1000,"Total Cost":c/1000}
Out[6]:
{'Packets': 969.582, 'Swaps': 4197.515, 'Total Cost': 773.4824}

970 packets, over 4000 swaps and the best part of £800 on the album. I think we’re going to need some people to swap with!

Of course, as you run these arguments, you will have different answers throughout. Hopefully here, however, our numbers are quite close together.

Summary

In this article, we have seen a basic example of running simulations with random numbers to answer a question.

We followed the process of replicating the album experience and running it once, then 1000 times to get an average expectation. As with any process involving random numbers, you will get different answers each time, so through running it loads of times over, we get an average that should remove the effect of any outliers.

We also designed our simulations to take on different parameters such as number of stickers needed, stickers in a pack, etc. This allows us to use the same functions when World Cup 2022 has twice the number of stickers!

For more examples of random numbers and simulations, check out our expected goals tutorial.

Posted by FCPythonADMIN in Blog

Calling an API with Python Requests – Visualising ClubElo data

Working in Python, your data is likely to come from a number of different places – spreadsheets, databases or elsewhere. Eventually, you will find that some interesting and useful data for you will be available through a web API – a stream of data that you will need to call from, download and format for your analysis.

This article will introduce calling an API with the requests library, before formatting it into a dataframe and visualising it. Our example makes use of the fantastic work done at clubelo.com – a site that applies the elo rating system to football. Their API is easy to use and provides us with a great opportunity to learn about the processes in this article!

Let’s get our modules together and get started:

In [1]:
import requests
import csv
from io import StringIO
import pandas as pd
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns

Calling an API

Downloading a dataset through an API must be complicated, surely? Of course, Python and its libraries make this as simple as possible. The requests library will do this quickly and easily with the ‘.get’ function. All we need to do is provide the api location that we want to read from. Other APIs will require authentification, but for now, we just need to provide the API address.

In [2]:
r = requests.get('http://api.clubelo.com/ManCity')

 

If you would like to run the tutorial with a different team, take a look at the instructions here and find your club on the site to find the correct name to use.

Our new ‘r’ variable contains a lot of information. It will hold the data that we will analyse, the address that we called from and a status code to let us know if it worked or not. Let’s check our status code:

In [3]:
r.status_code
Out[3]:
200

There are dozens of status codes, which you can find here, but we are hoping for a 200 code, telling us that the call went through as planned.

Now that we know that our request has made its way back, let’s check out what the API gives us with .text applied to the request (we have shortened the export dramatically, but it carries on as you see below):

In [4]:
r.text
Out[4]:

‘Rank,Club,Country,Level,Elo,From,To\n
None,Man City,ENG,2,1365.06604004,1946-07-07,1946-09-04\n

We’re given a load of text that, if you read carefully, is separated by commas and ‘\n’. Hopefully you recognise that this could be a CSV file!

 

Formatting our request data

We need to turn this into a spreadsheet-style dataframe in order to do anything with it. We will do this in two steps, firstly assigning this text to a readable csv variable with the StringIO library. We can then use Pandas to turn it into a dataframe. Check out how below:

In [5]:
data = StringIO(r.text)
df = pd.read_csv(data, sep=",")

df.head()
Out[5]:
Rank Club Country Level Elo From To
0 None Man City ENG 2 1365.066040 1946-07-07 1946-09-04
1 None Man City ENG 2 1372.480469 1946-09-05 1946-09-07
2 None Man City ENG 2 1369.613770 1946-09-08 1946-09-14
3 None Man City ENG 2 1383.733887 1946-09-15 1946-09-18
4 None Man City ENG 2 1385.578369 1946-09-19 1946-09-21

Awesome, we have a dataframe that we can analyse and visualise! One more thing that we need to format are the date columns. By default, they are strings of text and we need to reformat them to utilise the date functionality in our analysis.

Pandas makes this easy with the .to_datetime() function. Let’s reassign the from and to columns with this:

In [6]:
df.From = pd.to_datetime(df['From'])
df.To = pd.to_datetime(df['To'])

Visualising the data

The most obvious visualisation of this data is the journey that a team’s elo rating has taken.

As we have created our date columns, we can use matplotlib’s plot_date to easily create a time series chart. Let’s fire one off with our data that we’ve already set up:

In [7]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
plt.figure(num=None, figsize=(10, 4), dpi=80)

#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("Manchester City - Elo Rating", loc="left", fontsize=15)

#Display the chart
plt.show()
Manchester City Elo Chart

 

 

 

 

 

 

 

 

 

And let’s change a couple of the style options with matplotlib to tidy this up a bit. Hopefully you can figure out how we have changed the background colour, text style and size by reading through the two pieces of code.

In [8]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
fig = plt.figure(num=None, figsize=(15, 5), dpi=600)
axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
fig.patch.set_facecolor('#edeeef')


#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("   Manchester City - Elo Rating", loc="left", fontsize=18, fontname="Arial Rounded MT Bold")

#Display the chart
plt.show()
Manchester City Elo Evolution

Now this is a lot of code to piece through and do group-by-group, so let’s create a function to do it all in one go. Try and read through it carefully, matching it to the steps above.

In [9]:
def plotClub(team, colour = "dimgray"):
    r = requests.get('http://api.clubelo.com/' + str(team))
    data = StringIO(r.text)
    df = pd.read_csv(data, sep=",")
    
    df.From = pd.to_datetime(df['From'])
    df.To = pd.to_datetime(df['To'])
    
    sns.set_style("dark")
    fig = plt.figure(num=None, figsize=(12, 4), dpi=600)
    axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
    fig.patch.set_facecolor('#edeeef')    
    plt.plot_date(df.From, df.Elo,'-', color = colour)
    plt.title("    " + str(team) + " - Elo Rating", loc="left",  fontsize=18, fontname="Arial Rounded MT Bold")
    plt.show()

And let’s give it a go…

In [10]:
plotClub("RBLeipzig", "red")
 RB Leipzig Elo evolution

So we’re now calling, tidying and plotting our request in one go! Great work! Can you create a plot that compares two teams? Take a look through the matplotlib documentation to learn more about customising these plots too!

Of course, repeatedly calling an API is bad practice, so perhaps work on calling the data and storing it locally instead of making the same request over and over.

Summary

Being able to call data and structure it for analysis is a crucial skill to pick up and develop. This article has introduced the topic with the readily available and easy-to-utilise API available from clubelo. We owe them a thank you for their help and permission in putting this piece together!

To develop here, you should work on calling from APIs and storing the data for later use in a user-friendly format. Take a look at other sport and non-sport APIs and get practicing!

If you would like to learn more about formatting our charts like we have done above, take a look through some rules and code for better visualisations in Python.

Posted by FCPythonADMIN in Blog

Scraping Lists Through Transfermarkt and Saving Images

In this tutorial, we’ll be looking to develop our scraping knowledge beyond just lifting text from a single page. Following through the article, you’ll learn how to scrape links from a page and iterate through them to take information from each link, speeding up the process of creating new datasets. We will also run through how to identify and download images, creating a database of every player in the Premier League’s picture. This should save 10 minutes a week for anyone searching in Google Images to decorate their pre-match presentations!

This tutorial builds on the first article in our scraping series, so it is strongly recommended that you understand the concepts there before starting here.

Let’s import our modules and get started. Requests and BeautifulSoup will be recognised from last time, but os.path might be new. Os.path allows us to manipulate and utilise the operating system file structure, while basename gives us the ability to change and add file names – we’ll need this to give our pictures a proper name.

In [1]:
import requests
from bs4 import BeautifulSoup
from os.path  import basename

Our aim is to extract a picture of every player in the Premier League. We have identified Transfermarkt as our target, given that each player page should have a picture. Our secondary aim is to run this in one piece of code and not to run a new command for each player or team individually. To do this, we need to follow this process:

1) Locate a list of teams in the league with links to a squad list – then save these links

2) Run through each squad list link and save the link to each player’s page

3) Locate the player’s image and save it to our local computer

For what seems to be a massive task, we can distill it down to three main tasks. Below, we’ll break each one down.

Firstly, however, we need to set our headers to give the appearance of a human user when we call for data from Transfermarkt.

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

The Premier League page is the obvious place to start. As you can see, each team name is a link through to the squad page.

All that we need to do is process the page with BeautifulSoup (check the first article for more details) and identify the team links with ‘soup.select()’ with the links’ css selectors. These links should be added to a list for later.

Finally, we append these links to the transfermarkt domain so that we can call them on their own.

Check out the annotated code below for detailed instructions:

In [3]:
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    

So we now have 20 team links, with each looking like this:

In [4]:
teamLinks[14]
Out[4]:
'https://www.transfermarkt.co.uk/leicester-city/startseite/verein/1003/saison_id/2017'

We will now iterate through each of these team links and do the same thing, only this time we are taking player links and not squad links. Take a look through the code below, but you’ll notice that it is very similar to the last chunk of instructions – the key difference being that we will run it within a loop to go through all 20 teams in one go.

In [5]:
#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

Locate and save each player’s image

We now have a lot of links for players…

In [6]:
len(playerLinks)
Out[6]:
526

526 links, in fact! We now need to iterate through each of these links and save the player’s picture.

Hopefully you should now be comfortable with the process to download and process a webpage, but the second part of this step will need some unpacking – locating the image and saving it.

Once again, we are locating elements in the page. When we try to identify the correct image on the page, it seems that the best way to do this is through the ‘title’ attribute – which is the player’s name. It is ridiculous for us to manually enter the name for each one, so we need to find this elsewhere on the page. Fortunately, it is easy to find this as it is the only ‘h1’ elemnt.

Subsequently, we assign this name to the name variable, then use it to call the correct image.

When we call the image, we actually need to call the location where the image is saved on the website’s server. We do this by calling for the image’s source. The source contains some extra information that we don’t need, so we use .split() to isolate the information that we do need and save that to our ‘src’ variable.

The final thing to do is to save the image from this source location. We do this by opening a new file named after the player, then save the content from source to the new file. Incredibly, Python does this in just two lines. All images will be saved into the folder that your Python notebook or file is saved.

Try and follow through the code below with these instructions:

In [7]:
for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

This will take a couple of minutes to run, as we have 526 images to find and save. However, this short investment of time will save you 10 minutes each week in finding these pictures. Additionally, just change the link from the Premier League table to apply the code to any other league (assuming Transfermarkt is laid out in the same way!).

Your folder should now look something like this:

Images scraped from Transfermarkt

Summary

The aim of this article is to demonstrate two things. Firstly, how to collect links from a page and loop through them to further automate scraping. We have seen two examples of this – collect team and player links. It is clear to see how taking a bigger approach to scraping and understanding a website’s structure, we can collect information en masse, saving lots of time in the future.

Secondly, how to collect and save images. This article explains that images are saved on the website’s server, and we must locate where they are and save them from this location. Python makes this idea simple in execution as we can save from a location in just two lines. Also, by combining this with our iterations through players and times, we can save 526 pictures in a matter of minutes!

For further development, you may want to expand the data that you collect from each player, apply this logic to different sites, or even learn about navigating through your files to save players in team folders.

For your next FC Python course, why not take a look at our visualisation tutorials?

Posted by FCPythonADMIN in Blog, Scraping

Introduction to Scraping Data from Transfermarkt

Before starting the article, I’m obliged to mention that web scraping is a grey area legally and ethicaly in lots of circumstances. Please consider the positive and negative effects of what you scrape before doing so!

Warning over. Web scraping is a hugely powerful tool that, when done properly, can give you access to huge, clean data sources to power your analysis. The applications are just about endless for anyone interested in data. As a professional analyst, you can scrape fixtures and line-up data from around the world every day to plan scouting assignments or alert you to youth players breaking through. As an amateur analyst, it is quite likely to be your only source of data for analysis.

This tutorial is just an introduction for Python scraping. It will take you through the basic process of loading a page, locating information and retrieving it. Combine the knowledge on this page with for loops to cycle through a site and HTML knowledge to understand a web page, and you’ll be armed with just about any data you can find.

Let’s fire up our modules & get started. We’ll need requests (to access and process web pages with Python) and beautifulsoup (to make sense of the code that makes up the pages) so make sure you have these installed.

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd

Our process for extracting data is going to go something like this:

  1. Load the webpage containing the data.
  2. Locate the data within the page and extract it.
  3. Organise the data into a dataframe

For this example, we are going to take the player names and values for the most expensive players in a particular year. You can find the page that we’ll use here.

The following sections will run through each of these steps individually.

Load the webpage containing the data

The very first thing that we are going to do is create a variable called ‘headers’ and assign it a string that will tell the website that we are a browser, and not a scraping tool. In short, we’ll be blocked if we are thought to be scraping!

Next, we have three lines. The first one assigns the address that we want to scrape to a variable called ‘page’.

The second uses the requests library to grab the code of the page and assign it to ‘pageTree’. We use our headers variable here to inform the site that we are pretending to be a human browser.

Finally, the BeautifulSoup module parses the website code into html. We will then be able to search through this for the data that we want to extract. This is saved to ‘pageSoup’, and you can find all three lines here:

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Locate the data within a page & extract it

To fully appreciate what we are doing here, you probably need a basic grasp of HTML – the language that structures a webpage. As simply as I can put it for this article, HTML is made up of elements, like a paragraph or a link, that tell the browser what to render. For scraping, we will use this information to tell our program what information to take.

Take another look at the page we are scraping. We want two things – the player name and the transfer value.

The player name is a link. This is denoted as an ‘a’ tag in HTML, so we will use the ‘find_all’ function to look for all of the a tags in the page. However, there are obviously lots of links! Fortunately, we can use the class given to the players’ names specifically on this page to only take these ones – the class name is passed to the ‘find_all’ function as a dictionary.

This function will return a list with all elements that match our criteria.

If you’re curious, classes are usually used to apply styles (such as colour or border) to elements in HTML.

The code to extract the players names is here:

In [3]:
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})

#Let's look at the first name in the Players list.
Players[0].text
Out[3]:
'Luís Figo'

Looks like that works! Now let’s take the values.

As you can see on the page, the values are not a link, so we need to find a new feature to identify them by.

They are in a table cell, denoted by ‘td’ in HTML, so let’s look for that. The class to highlight these cells specifically is ‘rechts hauptlink’, as you’ll see below.

Let’s assign this to Values and check Figo’s transfer value:

In [4]:
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})

Values[0].text
Out[4]:
'£54.00m'

That’s a lot of money! Even in today’s market! But according to the page, our data is correct. Now all we need to do is process the data into a dataframe for further analysis or to save for use elsewhere.

Organise the data into a dataframe

This is pretty simple, we know that there are 25 players in the list, so let’s use a for loop to add the first 25 players and value to new lists (to ensure that no stragglers elsewhere in the page jump on). With these new lists, we’ll just create a new dataframe with them:

In [5]:
PlayersList = []
ValuesList = []

for i in range(0,25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)
    
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.head()
Out[5]:
Players Values
0 Luís Figo £54.00m
1 Hernán Crespo £51.13m
2 Marc Overmars £36.00m
3 Gabriel Batistuta £32.54m
4 Nicolas Anelka £31.05m

And now we have a dataframe with our scraped data, pretty much ready for analysis!

Summary

This article has gone through the absolute basics of scraping, we can now load a page, identify elements that we want to scrape and then process them into a dataframe.

There is more that we need to do to scrape efficiently though. Firstly, we can apply a for loop to the whole program above, changing the initial webpage name slightly to scrape the next year – I’ll let you figure out how!

You will also need to understand more about HTML, particularly class and ID selectors, to get the most out of scraping. Regardless, if you’ve followed along and understand what we’ve achieved and how, then you’re in a good place to apply this to other pages.

The techniques in this article gave the data for our joyplots tutorial, why not take a read of that next?

Posted by FCPythonADMIN in Blog, Scraping

Making Better Python Visualisations

FC Python recently received a tweet from @fitbawnumbers applying our lollipop chart code to Pep’s win percentage. It was great to see this application of the chart, and especially interesting because Philip then followed up with another chart showing the same data from Excel. To be blunt, the Excel chart was much cleaner/better than our lollipop charts – Philip had done a great job with it.

This has inspired us to put together a post exploring some of matplotlib’s customisation options and principles that underpin them. Hopefully this will give us a better looking and more engaging chart!

As a reminder, this is the chart that we are dealing with improving, and you can find the tutorial for lollipop charts here.

Step One – Remove everything that adds nothing

There is clearly lots that we can improve on here. Let’s start with the basics – if you can remove something without damaging your message, remove it. We have lots of ugly lines here, let’s remove the box needlessly around our data, along with those ticks. Likewise the axes labels, we know that the y axis shows teams – so let’s bin that too. We’ll do this with the following code:

In [ ]:
#For every side of the box, set to invisible

for side in ['right','left','top','bottom']:
    ax.spines[side].set_visible(False)
    
#Remove the ticks on the x and y axes

for tic in ax.xaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

for tic in ax.yaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

Step Two – Where appropriate, change the defaults

Philip’s Excel chart looked great because it didn’t look like an Excel chart. He had changed all of the defaults: the colours, the font, the label location. Subsequently, it doesn’t look like the charts that have bored us to death in presentations for decades. So let’s change our title locations and fonts to make it look like we’ve put some effort in beyond the defaults. Code below:

In [ ]:
#Change font
plt.rcParams["font.family"] = "DejaVu Sans"

#Instead of use plt.title, we'll use plt.text to fully customise it
#First two arguments are x/y location
plt.text(55, 19,"Premier League 16/17", size=18, fontweight="normal")

Step Three – Add labels if they are clean and give detail

While the lollipop chart makes it easy to understand the differences between teams, our orignal chart requires users to look all the way down if they want the value. Even then, the audience has to make a rough estimation. Why not add values to make everything a bit cleaner?

We can easily iterate through our values in the dataframe and plot them alongside the charts. The code below uses ‘enumerate()’ to count through each of the values in the points column of our table. For each value, it writes text at location v,i (nudged a bit with the sums below). Take a look at the for loop:

In [ ]:
for i, v in enumerate(table['Pts']):
    ax.text(v+2, i+0.8, str(v), color=teamColours[i], size = 13)

Step Four – Improve aesthetics with strong colour against off-white background

Our lollipop sticks are very, very thin. We can improve the look of these by giving them a decent thickness and a block of bold colour. Underneath this colour, we should add an off-white colour. This differentiates the plot from the rest of the page, and makes it look a lot more professional. Next time you see a great plot, take note of the base colour and try to understand the effect that this has on the plot and article as a whole!

Our code for doing these two things is below:

In [ ]:
#Set a linewidth in our hlines argument
plt.hlines(y=np.arange(1,21),xmin=0,xmax=table['Pts'],color=teamColours,linewidths=10)

#Set a background colour to the data area background and the plot as a whole
ax.set_facecolor('#f7f4f4')
fig.patch.set_facecolor('#f7f4f4')

Fitting it all together

Putting all of these steps together, we get something like the following. Follow along with the comments and see what fits in where:

In [1]:
#Set our plot and desired size
fig = plt.figure(figsize=(10,7))
ax = plt.subplot()

#Change our font
plt.rcParams["font.family"] = "DejaVu Sans"

#Each value is the hex code for the team's colours, in order of our chart
teamColours = ['#034694','#001C58','#5CBFEB','#D00027',
              '#EF0107','#DA020E','#274488','#ED1A3B',
               '#000000','#091453','#60223B','#0053A0',
               '#E03A3E','#1B458F','#000000','#53162f',
               '#FBEE23','#EF6610','#C92520','#BA1F1A']

#Plot our thicker lines and team names
plt.hlines(y=np.arange(1,21),xmin=0,xmax=table['Pts'],color=teamColours,linewidths=10)
plt.yticks(np.arange(1,21), table['Team'])

#Label our axes as needed and title the plot
plt.xlabel("Points")
plt.text(55, 19,"Premier League 16/17", size=18, fontweight="normal")

#Add the background colour
ax.set_facecolor('#f7f4f4')
fig.patch.set_facecolor('#f7f4f4')

for side in ['right','left','top','bottom']:
    ax.spines[side].set_visible(False)

for tic in ax.xaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

for tic in ax.yaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False
    
for i, v in enumerate(table['Pts']):
    ax.text(v+2, i+0.8, str(v), color=teamColours[i], size = 13)

plt.show()

Without doubt, this is a much better looking chart than the lollipop. Not only does it look better, but it gives us more information and communicates better than our former effort. Thank you Philip for the inspiration!

Summary

This article has looked at a few ways to tidy our charts. The rules that we introduced throughout should be applied to any visualisation that you’re looking to communicate with. Ensure that your charts are as clean as possible, are labelled and stray away from defaults. Follow these, and you’ll be well on your way to creating great plots!

Why not apply these rules to some of the other basic examples in our visualisation series and let us know how you improve on our articles!

Posted by FCPythonADMIN in Blog, Visualisation

Scraping Twitter with Tweepy and Python

Part of Twitter’s draw is the vast number of voices offering their opinions and thoughts on the latest events. In this article, we are going to look at the Tweepy module to show how we can search for a term used in tweets and return the thoughts of people talking about that topic. We’ll then look to make sense of them crudely by drawing a word cloud to show popular terms.

We’ll need the Tweepy and Wordcloud modules installed for this, so let’s fire these up alongside matplotlib.

In [1]:
import tweepy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

First up, you will need to get yourself keys to tap into Twitter’s API. These are freely available if you have a regular account from here.

When you have them, follow the code below to plug into the API. I’ve hidden my tokens and secrets, and would strongly recommend that you do too if you share any code!

Tweepy kindly handles all of the lifting here, you just need to provide it with your information:

In [2]:
access_token = "HIDDEN"
access_token_secret = "HIDDEN"
consumer_key = "HIDDEN"
consumer_secret = "HIDDEN"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

So we are looking to collect tweets on a particular term. Fortunately, Tweepy makes this pretty easy for us with its ‘Cursor’ function. The principle of Tweepy’s cursor is just like the one of your screen, it goes through tweets in Twitter’s API and does what we tell it to when it finds something. It does this to work through the vast ‘pages’ of tweets that run through Twitter every second.

In our example, we are going to create a function that takes our query and returns the 1000 most recent tweets that contain the query. We are then going to turn them into a string, tidy the string and return it. Follow the commented code below to learn how:

In [3]:
#Define a function that will take our search query, a limit of 1000 tweets by default, default to english language
#and allow us to pass a list of words to remove from the string
def tweetSearch(query, limit = 1000, language = "en", remove = []):
    
    #Create a blank variable
    text = ""
    
    #Iterate through Twitter using Tweepy to find our query in our language, with our defined limit
    #For every tweet that has our query, add it to our text holder in lower case
    for tweet in tweepy.Cursor(api.search, q=query, lang=language).items(limit):
        text += tweet.text.lower()
    
    #Twitter has lots of links, we need to remove the common parts of links to clean our data
    #Firstly, create a list of terms that we want to remove. This contains https & co, alongside any words in our remove list
    removeWords = ["https","co"]
    removeWords += remove
    
    #For each word in our removeWords list, replace it with nothing in our main text - deleting it
    for word in removeWords:
        text = text.replace(word, "")
    
    #return our clean text
    return text

With that all set up, let’s give it a spin with Arsenal’s biggest stories of the window so far. Hopefully we can get our finger on the pulse of what is happening with new signing Mkhitaryan & potential Gooner Aubameyang. Let’s run the command to get the text, then plot it in a wordcloud:

In [ ]:
#Generate our text with our new function
#Remove all mentions of the name itself, as this will obviously be the most common!
Mkhitaryan = tweetSearch("Mkhitaryan", remove = ["mkhitaryan"])
In [5]:
#Create the wordcloud with the text created above
wordcloud = WordCloud().generate(Mkhitaryan)

#Plot the text with the lines below
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Lots of club propaganda about how a player has always dreamt of playing for their new club?! We probably didn’t need a new function to tell us that!

And let’s do the same to learn a bit more about what the Twitter hivemind currently thinks about Aubameyang:

In [7]:
Auba = tweetSearch("Aubameyang")
In [8]:
wordcloud = WordCloud().generate(Auba)

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Equally predictably, we have “Sky Sources” talking about a bid in excess of a figure. Usual phraseology that we would expect in the build up to a transfer. I wish we had something unexpected and more imaginative, but at least we know we are getting something accurate. Hopefully you can find something more useful!

Summary

As you already know, Twitter is a huge collective of voices. On their own, this is white noise, but we can be smart about picking out terms and trying to understand the underlying opinions and individual voices. In this example, we have looked at the news on a new signing and potential signing and can see the usual story that the media puts across for players in these scenarios.

Alternative uses could be to run this during a match for crowd-sourced player ratings… or getting opinions on an awful new badge that a club has just released! We also don’t need word clouds for developing this, and you should look at language processing for some incredibly smart things that you can use to understand the sentiment in these messages.

You might also want to take a look at the docs to customise your wordclouds.

Next up – take a look through our other visualisation tutorials that you might also apply here.

Posted by FCPythonADMIN in Blog

Creating Personal Football Heatmaps in Python

Tracking technology has been a part of football analysis for the past 20 years, giving access to data on physical performance and heat map visualisations that show how far and wide a player covers. As this technology becomes cheaper and more accessible, it has now become easy for anyone to get this data on their Sunday morning games. This article runs through how you can create your own heatmaps for a game, with nothing more than a GPS tracking device (running watch, phone, gps unit) and Python.

To get your hands on your own data, you can extract your gpx file through Strava. While Strava is great for runs, it isn’t built for football or running in tight spaces. So let’s build our own!

Let’s import our necessary modules and data, then get started!

In [1]:
#GPXPY makes using .gpx files really easy
import gpxpy

#Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Opens our .gpx file, then parses it into a format that is easy for us to run through
gpx_file = open('5aside.gpx', 'r')
gpx = gpxpy.parse(gpx_file)

The .gpx file type, put simply, is a markup file that records the time and your location on each line. With location and time, we can calculate distance between locations and, subsequently, speed. We can also visualise this data, as we’ll show here.

Let’s take a look at what one of these lines looks like:

In [2]:
gpx.tracks[0].segments[0].points[0]
Out[2]:
GPXTrackPoint(51.5505, -0.3048, elevation=44, time=datetime.datetime(2018, 1, 19, 12, 14, 26))

The first two points are our latitude and longitude, alongside elevation and time. This gives us a lot of freedom to calculate variables and plot our data, and is the foundation of a lot of the advanced metrics that you will find on Strava.

In our example, we want to plot our latitude and longitude, so let’s use a for loop to add these to a list:

In [3]:
lat = []
lon = []

for track in gpx.tracks:
    for segment in track.segments:
        for point in segment.points:
            lat.append(point.latitude)
            lon.append(point.longitude)

Our location is now extraceted into a handy x and y format….let’s plot it. We’ve borrowed Andy Kee‘s Strava plotting aesthetic here, take a read of his article for more information on plotting your cycle/run data!

In [4]:
fig = plt.figure(facecolor = '0.1')
ax = plt.Axes(fig, [0., 0., 1., 1.], )
ax.set_aspect('equal')
ax.set_axis_off()
fig.add_axes(ax)
plt.plot(lon, lat, color = 'deepskyblue', lw = 0.3, alpha = 0.9)
plt.show()

The lines are great, and make for a beautiful plot, but let’s try and create a Prozone-esque heatmap on our pitch.

To do this, we can plot on the actual pitch that we played on, using the gmplot module. GM stands for Google Maps, and will import its functionality for our plot. Let’s take a look at how this works:

In [5]:
#Import the module first
import gmplot

#Start an instance of our map, with three arguments: lat/lon centre point of map - in this case,
#We'll use the first location in our data. The last argument is the default zoom level of the map
gmap = gmplot.GoogleMapPlotter(lat[0], lon[0], 20)

#Create our heatmap using our lat/lon lists for x and y coordinates
gmap.heatmap(lat, lon)

#Draw our map and save it to the html file named in the argument
gmap.draw("Player1.html")

This code will spit out a html file, that we can then open to get our heatmap plotted on a Google Maps background. Something like the below:

 Football heatmap created in Python

Summary

Similar visualisations of professional football matches set clubs and leagues back a pretty penny, and you can do this with entirely free software and increasingly affordable kit. While this won’t improve FC Python’s exceedingly poor on-pitch performances, we definitely think it is pretty cool!

Simply export your gpx data from Strava and extract the lat/long data, before plotting it as a line or as a heatmap on a map background for some really engaging visualisation.

Next up, learn about plotting this on a pitchmap, rather than satellite imagery.

Posted by FCPythonADMIN in Blog

Calculating ‘per 90’ with Python and Fantasy Football

When we are comparing data between players, it is very important that we standardise their data to ensure that each player has the same ‘opportunity’ to show their worth. The simplest way for us to do this, is to ensure that all players have the same amount of time within which to play. One popular way of doing this in football is to create ‘per 90’ values. This means that we will change our total amounts of goals, shots, etc. to show how many a player will do every 90 minutes of football that they play. This article will run through creating per 90 figures in Python by applying them to fantasy football points and data.

Follow the examples along below and feel free to use them where you are. Let’s get started by importing our modules and taking a look at our data set.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("../Data/Fantasy_Football.csv")
data.head()
Out[1]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_saved penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0 0.0 1 1
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 0 1 0 53 4 419 42.7 1 1
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0 0.0 1 1
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 0 3 0 0 14 421 62.5 2 1
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 0 2 77 15.7 2 1

5 rows × 26 columns

Our data has a host of data on our players’ fantasy football performance. We have their names, of course, and also their points and contributing factors (goals, clean sheets, etc.). Crucially, we have the players’ minutes played – allowing us to calculate their per 90 figures for the other variables.

Calculating our per 90 numbers is reasonably simple, we just need to find out how many 90 minute periods our player has played, then divide the variable by this value. The function below will show this step-by-step and show Kane’s goals p90 in the Premier League at the time of writing (goals = 20, minutes = 1868):

In [2]:
def p90_Calculator(variable_value, minutes_played):
    
    ninety_minute_periods = minutes_played/90
    
    p90_value = variable_value/ninety_minute_periods
    
    return p90_value

p90_Calculator(20, 1868)
Out[2]:
0.9635974304068522

There we go, Kane scores 0.96 goals per 90 in the Premier League! Our code, while explanatory is three lines long, when it can all be in one line. Let’s try again, and check that we get the same value:

In [3]:
def p90_Calculator(value, minutes):
    return value/(minutes/90)

p90_Calculator(20, 1868)
Out[3]:
0.9635974304068522

Great job! The code has the same result, in a third of the lines, and I still think it is fairly easy to understand.

Next up, we need to apply this to our dataset. Pandas makes this easy, as we can simply call a new column, and run our command with existing columns as arguments:

In [4]:
data["total_points_p90"] = p90_Calculator(data.total_points,data.minutes)
data.total_points_p90.fillna(0, inplace=True)
data.head()
Out[4]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 1 0 53 4 419 42.7 1 1 3.652174
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 3 0 0 14 421 62.5 2 1 4.288401
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 2 77 15.7 2 1 3.846154

5 rows × 27 columns

And there we have a total points per 90 column, which will hopefully offer some more insight than a simple points total. Let’s sort our values and view the top 5 players:

In [5]:
data.sort_values(by='total_points_p90', ascending =False).head()
Out[5]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
271 Tuanzebe 1 Axel Tuanzebe 38 39 0 1.7 1 1.0 0 0 0 0 0 3 0.0 2 12 90.0
322 Sims 20 Joshua Sims 39 43 0 0.1 1 1.0 0 0 0 0 0 3 0.0 3 14 90.0
394 Janssen 6 Vincent Janssen 9 74 0 0.1 1 1.0 0 0 0 0 0 2 0.0 4 17 90.0
166 Hefele 38 Michael Hefele 44 42 0 0.1 1 1.0 0 0 0 0 0 4 0.4 2 8 90.0
585 Silva 13 Adrien Sebastian Perruchet Silva 14 60 0 0.0 1 1.0 0 0 0 0 0 5 0.3 3 9 22.5

5 rows × 27 columns

Huh, probably not what we expected here… players with 1 point, and some surprisng names too. Upon further examination, these players suffer from their sample size. They’ve played very few minutes, so their numbers get overly inflated… there’s obviously no way a player gets that many points per 90!

Let’s set a minimum time played to our data to eliminate players without a big enough sample:

In [6]:
data.sort_values(by='total_points_p90', ascending =False)[data.minutes>400].head(10)[["web_name","total_points_p90"]]
Out[6]:
web_name total_points_p90
233 Salah 9.629408
279 Martial 8.927126
246 Sterling 8.378721
225 Coutinho 8.358882
325 Austin 8.003356
278 Lingard 7.951807
544 Niasse 7.460317
256 Agüero 7.346939
389 Son 7.288503
255 Bernardo Silva 7.119403

That seems a bit more like it! We’ve got some of the highest scoring players here, like Salah and Sterling, but if Austin, Lingard and Bernardo Silva can nail down long-term starting spots, we should certainly keep an eye on adding them in!

Let’s go back over this by creating a new column for goals per 90 and finding the top 10:

In [7]:
data["goals_p90"] = p90_Calculator(data.goals_scored,data.minutes)
data.goals_p90.fillna(0, inplace=True)
data.sort_values(by='goals_p90', ascending =False)[data.minutes>400].head(10)[["web_name","goals_p90"]]
Out[7]:
web_name goals_p90
233 Salah 0.968320
393 Kane 0.967222
325 Austin 0.906040
256 Agüero 0.823364
246 Sterling 0.797973
544 Niasse 0.793651
279 Martial 0.728745
258 Jesus 0.714995
278 Lingard 0.632530
160 Rooney 0.630252

Great job! Hopefully you can see that this is a much fairer way to rate our player data – whether for performance, fantasy football or media reporting purposes.

Summary

p90 data is a fundamental concept of football analytics. It is one of the first steps of cleaning our data and making it fit for comparisons. This article has shown how we can apply the concept quickly and easily to our data. For next steps, you might want to take a look at visualising this data, or looking at further analysis techniques.

Posted by FCPythonADMIN in Blog