Blog

How much does it cost to fill the Panini World Cup album? Simulations in Python

With the World Cup just 3 months away, the best bit of the tournament build up is upon us – the Panini sticker album.

For those looking to invest in a completed album to pass onto grandchildren, just how much will you have to spend to complete it on your own? Assuming that each sticker has an equal chance of being found, this is a simple random number problem that we can recreate in Python.

This article will show you how to create a function that allows you to estimate how much you will need to spend, before you throw wads of cash at sticker boxes to end with a half-finished album. Load up pandas and numpy and let’s kick on.

In [1]:
import pandas as pd
import numpy as np

To solve this, we are going to recreate our sticker album. It will be an empty list that will take on the new stickers that we find in each pack.

We will also need a few variables to act as counters alongside this list:

  • Stickers needed
  • How many packets have we bought?
  • How many swaps do we have?

Let’s define these:

In [1]:
stickersNeeded = 682
packetsBought = 0
stickersGot = []
swapStickers = 0

Now, we need to run a simulation that will open packs, check each sticker and either add it to our album or to our swaps pile.

We will do this by running a while loop that completes once the album is full.

This loop will open a pack of 5 stickers and check whether or not it is featured in the album already. To simulate the sticker, we will simply assign it a random number within the album. If this number is already present, we add it to the swap pile. If it is a new sticker, we append it to our album list.

We will also need to update our counters for packets bought, stickers needed and swaps throughout.

Pretty simple process overall! Let’s take a look at how we implement this loop:

In [2]:
while stickersNeeded > 0:
    
        #Buy a new packet
        packetsBought += 1

        #For each sticker, do some things 
        for i in range(0,5):
            
            #Assign the sticker a random number
            stickerNumber = np.random.randint(0,681)
    
            #Check if we have the sticker
            if stickerNumber not in stickersGot:
                
                #Add it to the album, then reduce our stickers needed count
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            #Throw it into the swaps pile
            else:
                swapStickers += 1

Each time you run that, you are simulating the entire album completion process! Let’s check out the results:

In [3]:
{"Packets":packetsBought,"Swaps":swapStickers}
Out[3]:
{'Packets': 939, 'Swaps': 4013}

939 packets?! 4013 swaps?! Surely these must be outliers… let’s add all of this into one function and run it loads of times over.

As the number of stickers in a pack and the sticker total may change, let’s define these as arguments that we can change with future uses of the function:

In [4]:
def calculateAlbum(stickersInPack = 5, costOfPackp = 80, stickerTotal=682):
    stickersNeeded = stickerTotal
    packetsBought = 0
    stickersGot = []
    swapStickers = 0


    while stickersNeeded > 0:
        packetsBought += 1

        for i in range(0,stickersInPack):
            stickerNumber = np.random.randint(0,stickerTotal)

            if stickerNumber not in stickersGot:
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            else:
                swapStickers += 1

    return{"Packets":packetsBought,"Swaps":swapStickers,
           "Total Cost":(packetsBought*costOfPackp)/100}
In [5]:
calculateAlbum()
Out[5]:
{'Packets': 1017, 'Swaps': 4403, 'Total Cost': 813.6}

So our calculateAlbum function does exactly the same as our instructions before, we have just added a total cost.

Let’s run this 1000 times over and see what we can truly expect if we want to complete the album:

In [6]:
a=0
b=0
c=0

for i in range(0, 1000):
    a += calculateAlbum()["Packets"]
    b += calculateAlbum()["Swaps"]
    c += calculateAlbum()["Total Cost"]

{"Packets":a/1000,"Swaps":b/1000,"Total Cost":c/1000}
Out[6]:
{'Packets': 969.582, 'Swaps': 4197.515, 'Total Cost': 773.4824}

970 packets, over 4000 swaps and the best part of £800 on the album. I think we’re going to need some people to swap with!

Of course, as you run these arguments, you will have different answers throughout. Hopefully here, however, our numbers are quite close together.

Summary

In this article, we have seen a basic example of running simulations with random numbers to answer a question.

We followed the process of replicating the album experience and running it once, then 1000 times to get an average expectation. As with any process involving random numbers, you will get different answers each time, so through running it loads of times over, we get an average that should remove the effect of any outliers.

We also designed our simulations to take on different parameters such as number of stickers needed, stickers in a pack, etc. This allows us to use the same functions when World Cup 2022 has twice the number of stickers!

For more examples of random numbers and simulations, check out our expected goals tutorial.

Posted by FCPythonADMIN in Blog

Calling an API with Python Requests – Visualising ClubElo data

Working in Python, your data is likely to come from a number of different places – spreadsheets, databases or elsewhere. Eventually, you will find that some interesting and useful data for you will be available through a web API – a stream of data that you will need to call from, download and format for your analysis.

This article will introduce calling an API with the requests library, before formatting it into a dataframe and visualising it. Our example makes use of the fantastic work done at clubelo.com – a site that applies the elo rating system to football. Their API is easy to use and provides us with a great opportunity to learn about the processes in this article!

Let’s get our modules together and get started:

In [1]:
import requests
import csv
from io import StringIO
import pandas as pd
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns

Calling an API

Downloading a dataset through an API must be complicated, surely? Of course, Python and its libraries make this as simple as possible. The requests library will do this quickly and easily with the ‘.get’ function. All we need to do is provide the api location that we want to read from. Other APIs will require authentification, but for now, we just need to provide the API address.

In [2]:
r = requests.get('http://api.clubelo.com/ManCity')

 

If you would like to run the tutorial with a different team, take a look at the instructions here and find your club on the site to find the correct name to use.

Our new ‘r’ variable contains a lot of information. It will hold the data that we will analyse, the address that we called from and a status code to let us know if it worked or not. Let’s check our status code:

In [3]:
r.status_code
Out[3]:
200

There are dozens of status codes, which you can find here, but we are hoping for a 200 code, telling us that the call went through as planned.

Now that we know that our request has made its way back, let’s check out what the API gives us with .text applied to the request (we have shortened the export dramatically, but it carries on as you see below):

In [4]:
r.text
Out[4]:

‘Rank,Club,Country,Level,Elo,From,To\n
None,Man City,ENG,2,1365.06604004,1946-07-07,1946-09-04\n

We’re given a load of text that, if you read carefully, is separated by commas and ‘\n’. Hopefully you recognise that this could be a CSV file!

 

Formatting our request data

We need to turn this into a spreadsheet-style dataframe in order to do anything with it. We will do this in two steps, firstly assigning this text to a readable csv variable with the StringIO library. We can then use Pandas to turn it into a dataframe. Check out how below:

In [5]:
data = StringIO(r.text)
df = pd.read_csv(data, sep=",")

df.head()
Out[5]:
Rank Club Country Level Elo From To
0 None Man City ENG 2 1365.066040 1946-07-07 1946-09-04
1 None Man City ENG 2 1372.480469 1946-09-05 1946-09-07
2 None Man City ENG 2 1369.613770 1946-09-08 1946-09-14
3 None Man City ENG 2 1383.733887 1946-09-15 1946-09-18
4 None Man City ENG 2 1385.578369 1946-09-19 1946-09-21

Awesome, we have a dataframe that we can analyse and visualise! One more thing that we need to format are the date columns. By default, they are strings of text and we need to reformat them to utilise the date functionality in our analysis.

Pandas makes this easy with the .to_datetime() function. Let’s reassign the from and to columns with this:

In [6]:
df.From = pd.to_datetime(df['From'])
df.To = pd.to_datetime(df['To'])

Visualising the data

The most obvious visualisation of this data is the journey that a team’s elo rating has taken.

As we have created our date columns, we can use matplotlib’s plot_date to easily create a time series chart. Let’s fire one off with our data that we’ve already set up:

In [7]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
plt.figure(num=None, figsize=(10, 4), dpi=80)

#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("Manchester City - Elo Rating", loc="left", fontsize=15)

#Display the chart
plt.show()
Manchester City Elo Chart

 

 

 

 

 

 

 

 

 

And let’s change a couple of the style options with matplotlib to tidy this up a bit. Hopefully you can figure out how we have changed the background colour, text style and size by reading through the two pieces of code.

In [8]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
fig = plt.figure(num=None, figsize=(15, 5), dpi=600)
axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
fig.patch.set_facecolor('#edeeef')


#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("   Manchester City - Elo Rating", loc="left", fontsize=18, fontname="Arial Rounded MT Bold")

#Display the chart
plt.show()
Manchester City Elo Evolution

Now this is a lot of code to piece through and do group-by-group, so let’s create a function to do it all in one go. Try and read through it carefully, matching it to the steps above.

In [9]:
def plotClub(team, colour = "dimgray"):
    r = requests.get('http://api.clubelo.com/' + str(team))
    data = StringIO(r.text)
    df = pd.read_csv(data, sep=",")
    
    df.From = pd.to_datetime(df['From'])
    df.To = pd.to_datetime(df['To'])
    
    sns.set_style("dark")
    fig = plt.figure(num=None, figsize=(12, 4), dpi=600)
    axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
    fig.patch.set_facecolor('#edeeef')    
    plt.plot_date(df.From, df.Elo,'-', color = colour)
    plt.title("    " + str(team) + " - Elo Rating", loc="left",  fontsize=18, fontname="Arial Rounded MT Bold")
    plt.show()

And let’s give it a go…

In [10]:
plotClub("RBLeipzig", "red")
 RB Leipzig Elo evolution

So we’re now calling, tidying and plotting our request in one go! Great work! Can you create a plot that compares two teams? Take a look through the matplotlib documentation to learn more about customising these plots too!

Of course, repeatedly calling an API is bad practice, so perhaps work on calling the data and storing it locally instead of making the same request over and over.

Summary

Being able to call data and structure it for analysis is a crucial skill to pick up and develop. This article has introduced the topic with the readily available and easy-to-utilise API available from clubelo. We owe them a thank you for their help and permission in putting this piece together!

To develop here, you should work on calling from APIs and storing the data for later use in a user-friendly format. Take a look at other sport and non-sport APIs and get practicing!

If you would like to learn more about formatting our charts like we have done above, take a look through some rules and code for better visualisations in Python.

Posted by FCPythonADMIN in Blog

Scraping Lists Through Transfermarkt and Saving Images

In this tutorial, we’ll be looking to develop our scraping knowledge beyond just lifting text from a single page. Following through the article, you’ll learn how to scrape links from a page and iterate through them to take information from each link, speeding up the process of creating new datasets. We will also run through how to identify and download images, creating a database of every player in the Premier League’s picture. This should save 10 minutes a week for anyone searching in Google Images to decorate their pre-match presentations!

This tutorial builds on the first article in our scraping series, so it is strongly recommended that you understand the concepts there before starting here.

Let’s import our modules and get started. Requests and BeautifulSoup will be recognised from last time, but os.path might be new. Os.path allows us to manipulate and utilise the operating system file structure, while basename gives us the ability to change and add file names – we’ll need this to give our pictures a proper name.

In [1]:
import requests
from bs4 import BeautifulSoup
from os.path  import basename

Our aim is to extract a picture of every player in the Premier League. We have identified Transfermarkt as our target, given that each player page should have a picture. Our secondary aim is to run this in one piece of code and not to run a new command for each player or team individually. To do this, we need to follow this process:

1) Locate a list of teams in the league with links to a squad list – then save these links

2) Run through each squad list link and save the link to each player’s page

3) Locate the player’s image and save it to our local computer

For what seems to be a massive task, we can distill it down to three main tasks. Below, we’ll break each one down.

Firstly, however, we need to set our headers to give the appearance of a human user when we call for data from Transfermarkt.

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

The Premier League page is the obvious place to start. As you can see, each team name is a link through to the squad page.

All that we need to do is process the page with BeautifulSoup (check the first article for more details) and identify the team links with ‘soup.select()’ with the links’ css selectors. These links should be added to a list for later.

Finally, we append these links to the transfermarkt domain so that we can call them on their own.

Check out the annotated code below for detailed instructions:

In [3]:
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    

So we now have 20 team links, with each looking like this:

In [4]:
teamLinks[14]
Out[4]:
'https://www.transfermarkt.co.uk/leicester-city/startseite/verein/1003/saison_id/2017'

We will now iterate through each of these team links and do the same thing, only this time we are taking player links and not squad links. Take a look through the code below, but you’ll notice that it is very similar to the last chunk of instructions – the key difference being that we will run it within a loop to go through all 20 teams in one go.

In [5]:
#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

Locate and save each player’s image

We now have a lot of links for players…

In [6]:
len(playerLinks)
Out[6]:
526

526 links, in fact! We now need to iterate through each of these links and save the player’s picture.

Hopefully you should now be comfortable with the process to download and process a webpage, but the second part of this step will need some unpacking – locating the image and saving it.

Once again, we are locating elements in the page. When we try to identify the correct image on the page, it seems that the best way to do this is through the ‘title’ attribute – which is the player’s name. It is ridiculous for us to manually enter the name for each one, so we need to find this elsewhere on the page. Fortunately, it is easy to find this as it is the only ‘h1’ elemnt.

Subsequently, we assign this name to the name variable, then use it to call the correct image.

When we call the image, we actually need to call the location where the image is saved on the website’s server. We do this by calling for the image’s source. The source contains some extra information that we don’t need, so we use .split() to isolate the information that we do need and save that to our ‘src’ variable.

The final thing to do is to save the image from this source location. We do this by opening a new file named after the player, then save the content from source to the new file. Incredibly, Python does this in just two lines. All images will be saved into the folder that your Python notebook or file is saved.

Try and follow through the code below with these instructions:

In [7]:
for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

This will take a couple of minutes to run, as we have 526 images to find and save. However, this short investment of time will save you 10 minutes each week in finding these pictures. Additionally, just change the link from the Premier League table to apply the code to any other league (assuming Transfermarkt is laid out in the same way!).

Your folder should now look something like this:

Images scraped from Transfermarkt

Summary

The aim of this article is to demonstrate two things. Firstly, how to collect links from a page and loop through them to further automate scraping. We have seen two examples of this – collect team and player links. It is clear to see how taking a bigger approach to scraping and understanding a website’s structure, we can collect information en masse, saving lots of time in the future.

Secondly, how to collect and save images. This article explains that images are saved on the website’s server, and we must locate where they are and save them from this location. Python makes this idea simple in execution as we can save from a location in just two lines. Also, by combining this with our iterations through players and times, we can save 526 pictures in a matter of minutes!

For further development, you may want to expand the data that you collect from each player, apply this logic to different sites, or even learn about navigating through your files to save players in team folders.

For your next FC Python course, why not take a look at our visualisation tutorials?

Posted by FCPythonADMIN in Blog, Scraping

Introduction to Scraping Data from Transfermarkt

Before starting the article, I’m obliged to mention that web scraping is a grey area legally and ethicaly in lots of circumstances. Please consider the positive and negative effects of what you scrape before doing so!

Warning over. Web scraping is a hugely powerful tool that, when done properly, can give you access to huge, clean data sources to power your analysis. The applications are just about endless for anyone interested in data. As a professional analyst, you can scrape fixtures and line-up data from around the world every day to plan scouting assignments or alert you to youth players breaking through. As an amateur analyst, it is quite likely to be your only source of data for analysis.

This tutorial is just an introduction for Python scraping. It will take you through the basic process of loading a page, locating information and retrieving it. Combine the knowledge on this page with for loops to cycle through a site and HTML knowledge to understand a web page, and you’ll be armed with just about any data you can find.

Let’s fire up our modules & get started. We’ll need requests (to access and process web pages with Python) and beautifulsoup (to make sense of the code that makes up the pages) so make sure you have these installed.

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd

Our process for extracting data is going to go something like this:

  1. Load the webpage containing the data.
  2. Locate the data within the page and extract it.
  3. Organise the data into a dataframe

For this example, we are going to take the player names and values for the most expensive players in a particular year. You can find the page that we’ll use here.

The following sections will run through each of these steps individually.

Load the webpage containing the data

The very first thing that we are going to do is create a variable called ‘headers’ and assign it a string that will tell the website that we are a browser, and not a scraping tool. In short, we’ll be blocked if we are thought to be scraping!

Next, we have three lines. The first one assigns the address that we want to scrape to a variable called ‘page’.

The second uses the requests library to grab the code of the page and assign it to ‘pageTree’. We use our headers variable here to inform the site that we are pretending to be a human browser.

Finally, the BeautifulSoup module parses the website code into html. We will then be able to search through this for the data that we want to extract. This is saved to ‘pageSoup’, and you can find all three lines here:

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Locate the data within a page & extract it

To fully appreciate what we are doing here, you probably need a basic grasp of HTML – the language that structures a webpage. As simply as I can put it for this article, HTML is made up of elements, like a paragraph or a link, that tell the browser what to render. For scraping, we will use this information to tell our program what information to take.

Take another look at the page we are scraping. We want two things – the player name and the transfer value.

The player name is a link. This is denoted as an ‘a’ tag in HTML, so we will use the ‘find_all’ function to look for all of the a tags in the page. However, there are obviously lots of links! Fortunately, we can use the class given to the players’ names specifically on this page to only take these ones – the class name is passed to the ‘find_all’ function as a dictionary.

This function will return a list with all elements that match our criteria.

If you’re curious, classes are usually used to apply styles (such as colour or border) to elements in HTML.

The code to extract the players names is here:

In [3]:
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})

#Let's look at the first name in the Players list.
Players[0].text
Out[3]:
'Luís Figo'

Looks like that works! Now let’s take the values.

As you can see on the page, the values are not a link, so we need to find a new feature to identify them by.

They are in a table cell, denoted by ‘td’ in HTML, so let’s look for that. The class to highlight these cells specifically is ‘rechts hauptlink’, as you’ll see below.

Let’s assign this to Values and check Figo’s transfer value:

In [4]:
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})

Values[0].text
Out[4]:
'£54.00m'

That’s a lot of money! Even in today’s market! But according to the page, our data is correct. Now all we need to do is process the data into a dataframe for further analysis or to save for use elsewhere.

Organise the data into a dataframe

This is pretty simple, we know that there are 25 players in the list, so let’s use a for loop to add the first 25 players and value to new lists (to ensure that no stragglers elsewhere in the page jump on). With these new lists, we’ll just create a new dataframe with them:

In [5]:
PlayersList = []
ValuesList = []

for i in range(0,25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)
    
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.head()
Out[5]:
Players Values
0 Luís Figo £54.00m
1 Hernán Crespo £51.13m
2 Marc Overmars £36.00m
3 Gabriel Batistuta £32.54m
4 Nicolas Anelka £31.05m

And now we have a dataframe with our scraped data, pretty much ready for analysis!

Summary

This article has gone through the absolute basics of scraping, we can now load a page, identify elements that we want to scrape and then process them into a dataframe.

There is more that we need to do to scrape efficiently though. Firstly, we can apply a for loop to the whole program above, changing the initial webpage name slightly to scrape the next year – I’ll let you figure out how!

You will also need to understand more about HTML, particularly class and ID selectors, to get the most out of scraping. Regardless, if you’ve followed along and understand what we’ve achieved and how, then you’re in a good place to apply this to other pages.

The techniques in this article gave the data for our joyplots tutorial, why not take a read of that next?

Posted by FCPythonADMIN in Blog, Scraping

Making Better Python Visualisations

FC Python recently received a tweet from @fitbawnumbers applying our lollipop chart code to Pep’s win percentage. It was great to see this application of the chart, and especially interesting because Philip then followed up with another chart showing the same data from Excel. To be blunt, the Excel chart was much cleaner/better than our lollipop charts – Philip had done a great job with it.

This has inspired us to put together a post exploring some of matplotlib’s customisation options and principles that underpin them. Hopefully this will give us a better looking and more engaging chart!

As a reminder, this is the chart that we are dealing with improving, and you can find the tutorial for lollipop charts here.

Step One – Remove everything that adds nothing

There is clearly lots that we can improve on here. Let’s start with the basics – if you can remove something without damaging your message, remove it. We have lots of ugly lines here, let’s remove the box needlessly around our data, along with those ticks. Likewise the axes labels, we know that the y axis shows teams – so let’s bin that too. We’ll do this with the following code:

In [ ]:
#For every side of the box, set to invisible

for side in ['right','left','top','bottom']:
    ax.spines[side].set_visible(False)
    
#Remove the ticks on the x and y axes

for tic in ax.xaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

for tic in ax.yaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

Step Two – Where appropriate, change the defaults

Philip’s Excel chart looked great because it didn’t look like an Excel chart. He had changed all of the defaults: the colours, the font, the label location. Subsequently, it doesn’t look like the charts that have bored us to death in presentations for decades. So let’s change our title locations and fonts to make it look like we’ve put some effort in beyond the defaults. Code below:

In [ ]:
#Change font
plt.rcParams["font.family"] = "DejaVu Sans"

#Instead of use plt.title, we'll use plt.text to fully customise it
#First two arguments are x/y location
plt.text(55, 19,"Premier League 16/17", size=18, fontweight="normal")

Step Three – Add labels if they are clean and give detail

While the lollipop chart makes it easy to understand the differences between teams, our orignal chart requires users to look all the way down if they want the value. Even then, the audience has to make a rough estimation. Why not add values to make everything a bit cleaner?

We can easily iterate through our values in the dataframe and plot them alongside the charts. The code below uses ‘enumerate()’ to count through each of the values in the points column of our table. For each value, it writes text at location v,i (nudged a bit with the sums below). Take a look at the for loop:

In [ ]:
for i, v in enumerate(table['Pts']):
    ax.text(v+2, i+0.8, str(v), color=teamColours[i], size = 13)

Step Four – Improve aesthetics with strong colour against off-white background

Our lollipop sticks are very, very thin. We can improve the look of these by giving them a decent thickness and a block of bold colour. Underneath this colour, we should add an off-white colour. This differentiates the plot from the rest of the page, and makes it look a lot more professional. Next time you see a great plot, take note of the base colour and try to understand the effect that this has on the plot and article as a whole!

Our code for doing these two things is below:

In [ ]:
#Set a linewidth in our hlines argument
plt.hlines(y=np.arange(1,21),xmin=0,xmax=table['Pts'],color=teamColours,linewidths=10)

#Set a background colour to the data area background and the plot as a whole
ax.set_facecolor('#f7f4f4')
fig.patch.set_facecolor('#f7f4f4')

Fitting it all together

Putting all of these steps together, we get something like the following. Follow along with the comments and see what fits in where:

In [1]:
#Set our plot and desired size
fig = plt.figure(figsize=(10,7))
ax = plt.subplot()

#Change our font
plt.rcParams["font.family"] = "DejaVu Sans"

#Each value is the hex code for the team's colours, in order of our chart
teamColours = ['#034694','#001C58','#5CBFEB','#D00027',
              '#EF0107','#DA020E','#274488','#ED1A3B',
               '#000000','#091453','#60223B','#0053A0',
               '#E03A3E','#1B458F','#000000','#53162f',
               '#FBEE23','#EF6610','#C92520','#BA1F1A']

#Plot our thicker lines and team names
plt.hlines(y=np.arange(1,21),xmin=0,xmax=table['Pts'],color=teamColours,linewidths=10)
plt.yticks(np.arange(1,21), table['Team'])

#Label our axes as needed and title the plot
plt.xlabel("Points")
plt.text(55, 19,"Premier League 16/17", size=18, fontweight="normal")

#Add the background colour
ax.set_facecolor('#f7f4f4')
fig.patch.set_facecolor('#f7f4f4')

for side in ['right','left','top','bottom']:
    ax.spines[side].set_visible(False)

for tic in ax.xaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False

for tic in ax.yaxis.get_major_ticks():
    tic.tick1On = tic.tick2On = False
    
for i, v in enumerate(table['Pts']):
    ax.text(v+2, i+0.8, str(v), color=teamColours[i], size = 13)

plt.show()

Without doubt, this is a much better looking chart than the lollipop. Not only does it look better, but it gives us more information and communicates better than our former effort. Thank you Philip for the inspiration!

Summary

This article has looked at a few ways to tidy our charts. The rules that we introduced throughout should be applied to any visualisation that you’re looking to communicate with. Ensure that your charts are as clean as possible, are labelled and stray away from defaults. Follow these, and you’ll be well on your way to creating great plots!

Why not apply these rules to some of the other basic examples in our visualisation series and let us know how you improve on our articles!

Posted by FCPythonADMIN in Blog, Visualisation

Scraping Twitter with Tweepy and Python

Part of Twitter’s draw is the vast number of voices offering their opinions and thoughts on the latest events. In this article, we are going to look at the Tweepy module to show how we can search for a term used in tweets and return the thoughts of people talking about that topic. We’ll then look to make sense of them crudely by drawing a word cloud to show popular terms.

We’ll need the Tweepy and Wordcloud modules installed for this, so let’s fire these up alongside matplotlib.

In [1]:
import tweepy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

First up, you will need to get yourself keys to tap into Twitter’s API. These are freely available if you have a regular account from here.

When you have them, follow the code below to plug into the API. I’ve hidden my tokens and secrets, and would strongly recommend that you do too if you share any code!

Tweepy kindly handles all of the lifting here, you just need to provide it with your information:

In [2]:
access_token = "HIDDEN"
access_token_secret = "HIDDEN"
consumer_key = "HIDDEN"
consumer_secret = "HIDDEN"


auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

So we are looking to collect tweets on a particular term. Fortunately, Tweepy makes this pretty easy for us with its ‘Cursor’ function. The principle of Tweepy’s cursor is just like the one of your screen, it goes through tweets in Twitter’s API and does what we tell it to when it finds something. It does this to work through the vast ‘pages’ of tweets that run through Twitter every second.

In our example, we are going to create a function that takes our query and returns the 1000 most recent tweets that contain the query. We are then going to turn them into a string, tidy the string and return it. Follow the commented code below to learn how:

In [3]:
#Define a function that will take our search query, a limit of 1000 tweets by default, default to english language
#and allow us to pass a list of words to remove from the string
def tweetSearch(query, limit = 1000, language = "en", remove = []):
    
    #Create a blank variable
    text = ""
    
    #Iterate through Twitter using Tweepy to find our query in our language, with our defined limit
    #For every tweet that has our query, add it to our text holder in lower case
    for tweet in tweepy.Cursor(api.search, q=query, lang=language).items(limit):
        text += tweet.text.lower()
    
    #Twitter has lots of links, we need to remove the common parts of links to clean our data
    #Firstly, create a list of terms that we want to remove. This contains https & co, alongside any words in our remove list
    removeWords = ["https","co"]
    removeWords += remove
    
    #For each word in our removeWords list, replace it with nothing in our main text - deleting it
    for word in removeWords:
        text = text.replace(word, "")
    
    #return our clean text
    return text

With that all set up, let’s give it a spin with Arsenal’s biggest stories of the window so far. Hopefully we can get our finger on the pulse of what is happening with new signing Mkhitaryan & potential Gooner Aubameyang. Let’s run the command to get the text, then plot it in a wordcloud:

In [ ]:
#Generate our text with our new function
#Remove all mentions of the name itself, as this will obviously be the most common!
Mkhitaryan = tweetSearch("Mkhitaryan", remove = ["mkhitaryan"])
In [5]:
#Create the wordcloud with the text created above
wordcloud = WordCloud().generate(Mkhitaryan)

#Plot the text with the lines below
plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Lots of club propaganda about how a player has always dreamt of playing for their new club?! We probably didn’t need a new function to tell us that!

And let’s do the same to learn a bit more about what the Twitter hivemind currently thinks about Aubameyang:

In [7]:
Auba = tweetSearch("Aubameyang")
In [8]:
wordcloud = WordCloud().generate(Auba)

plt.figure(figsize=(12,6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Equally predictably, we have “Sky Sources” talking about a bid in excess of a figure. Usual phraseology that we would expect in the build up to a transfer. I wish we had something unexpected and more imaginative, but at least we know we are getting something accurate. Hopefully you can find something more useful!

Summary

As you already know, Twitter is a huge collective of voices. On their own, this is white noise, but we can be smart about picking out terms and trying to understand the underlying opinions and individual voices. In this example, we have looked at the news on a new signing and potential signing and can see the usual story that the media puts across for players in these scenarios.

Alternative uses could be to run this during a match for crowd-sourced player ratings… or getting opinions on an awful new badge that a club has just released! We also don’t need word clouds for developing this, and you should look at language processing for some incredibly smart things that you can use to understand the sentiment in these messages.

You might also want to take a look at the docs to customise your wordclouds.

Next up – take a look through our other visualisation tutorials that you might also apply here.

Posted by FCPythonADMIN in Blog

Creating Personal Football Heatmaps in Python

Tracking technology has been a part of football analysis for the past 20 years, giving access to data on physical performance and heat map visualisations that show how far and wide a player covers. As this technology becomes cheaper and more accessible, it has now become easy for anyone to get this data on their Sunday morning games. This article runs through how you can create your own heatmaps for a game, with nothing more than a GPS tracking device (running watch, phone, gps unit) and Python.

To get your hands on your own data, you can extract your gpx file through Strava. While Strava is great for runs, it isn’t built for football or running in tight spaces. So let’s build our own!

Let’s import our necessary modules and data, then get started!

In [1]:
#GPXPY makes using .gpx files really easy
import gpxpy

#Visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Opens our .gpx file, then parses it into a format that is easy for us to run through
gpx_file = open('5aside.gpx', 'r')
gpx = gpxpy.parse(gpx_file)

The .gpx file type, put simply, is a markup file that records the time and your location on each line. With location and time, we can calculate distance between locations and, subsequently, speed. We can also visualise this data, as we’ll show here.

Let’s take a look at what one of these lines looks like:

In [2]:
gpx.tracks[0].segments[0].points[0]
Out[2]:
GPXTrackPoint(51.5505, -0.3048, elevation=44, time=datetime.datetime(2018, 1, 19, 12, 14, 26))

The first two points are our latitude and longitude, alongside elevation and time. This gives us a lot of freedom to calculate variables and plot our data, and is the foundation of a lot of the advanced metrics that you will find on Strava.

In our example, we want to plot our latitude and longitude, so let’s use a for loop to add these to a list:

In [3]:
lat = []
lon = []

for track in gpx.tracks:
    for segment in track.segments:
        for point in segment.points:
            lat.append(point.latitude)
            lon.append(point.longitude)

Our location is now extraceted into a handy x and y format….let’s plot it. We’ve borrowed Andy Kee‘s Strava plotting aesthetic here, take a read of his article for more information on plotting your cycle/run data!

In [4]:
fig = plt.figure(facecolor = '0.1')
ax = plt.Axes(fig, [0., 0., 1., 1.], )
ax.set_aspect('equal')
ax.set_axis_off()
fig.add_axes(ax)
plt.plot(lon, lat, color = 'deepskyblue', lw = 0.3, alpha = 0.9)
plt.show()

The lines are great, and make for a beautiful plot, but let’s try and create a Prozone-esque heatmap on our pitch.

To do this, we can plot on the actual pitch that we played on, using the gmplot module. GM stands for Google Maps, and will import its functionality for our plot. Let’s take a look at how this works:

In [5]:
#Import the module first
import gmplot

#Start an instance of our map, with three arguments: lat/lon centre point of map - in this case,
#We'll use the first location in our data. The last argument is the default zoom level of the map
gmap = gmplot.GoogleMapPlotter(lat[0], lon[0], 20)

#Create our heatmap using our lat/lon lists for x and y coordinates
gmap.heatmap(lat, lon)

#Draw our map and save it to the html file named in the argument
gmap.draw("Player1.html")

This code will spit out a html file, that we can then open to get our heatmap plotted on a Google Maps background. Something like the below:

 Football heatmap created in Python

Summary

Similar visualisations of professional football matches set clubs and leagues back a pretty penny, and you can do this with entirely free software and increasingly affordable kit. While this won’t improve FC Python’s exceedingly poor on-pitch performances, we definitely think it is pretty cool!

Simply export your gpx data from Strava and extract the lat/long data, before plotting it as a line or as a heatmap on a map background for some really engaging visualisation.

Next up, learn about plotting this on a pitchmap, rather than satellite imagery.

Posted by FCPythonADMIN in Blog

Calculating ‘per 90’ with Python and Fantasy Football

When we are comparing data between players, it is very important that we standardise their data to ensure that each player has the same ‘opportunity’ to show their worth. The simplest way for us to do this, is to ensure that all players have the same amount of time within which to play. One popular way of doing this in football is to create ‘per 90’ values. This means that we will change our total amounts of goals, shots, etc. to show how many a player will do every 90 minutes of football that they play. This article will run through creating per 90 figures in Python by applying them to fantasy football points and data.

Follow the examples along below and feel free to use them where you are. Let’s get started by importing our modules and taking a look at our data set.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("../Data/Fantasy_Football.csv")
data.head()
Out[1]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_saved penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0 0.0 1 1
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 0 1 0 53 4 419 42.7 1 1
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0 0.0 1 1
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 0 3 0 0 14 421 62.5 2 1
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 0 2 77 15.7 2 1

5 rows × 26 columns

Our data has a host of data on our players’ fantasy football performance. We have their names, of course, and also their points and contributing factors (goals, clean sheets, etc.). Crucially, we have the players’ minutes played – allowing us to calculate their per 90 figures for the other variables.

Calculating our per 90 numbers is reasonably simple, we just need to find out how many 90 minute periods our player has played, then divide the variable by this value. The function below will show this step-by-step and show Kane’s goals p90 in the Premier League at the time of writing (goals = 20, minutes = 1868):

In [2]:
def p90_Calculator(variable_value, minutes_played):
    
    ninety_minute_periods = minutes_played/90
    
    p90_value = variable_value/ninety_minute_periods
    
    return p90_value

p90_Calculator(20, 1868)
Out[2]:
0.9635974304068522

There we go, Kane scores 0.96 goals per 90 in the Premier League! Our code, while explanatory is three lines long, when it can all be in one line. Let’s try again, and check that we get the same value:

In [3]:
def p90_Calculator(value, minutes):
    return value/(minutes/90)

p90_Calculator(20, 1868)
Out[3]:
0.9635974304068522

Great job! The code has the same result, in a third of the lines, and I still think it is fairly easy to understand.

Next up, we need to apply this to our dataset. Pandas makes this easy, as we can simply call a new column, and run our command with existing columns as arguments:

In [4]:
data["total_points_p90"] = p90_Calculator(data.total_points,data.minutes)
data.total_points_p90.fillna(0, inplace=True)
data.head()
Out[4]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 1 0 53 4 419 42.7 1 1 3.652174
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 3 0 0 14 421 62.5 2 1 4.288401
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 2 77 15.7 2 1 3.846154

5 rows × 27 columns

And there we have a total points per 90 column, which will hopefully offer some more insight than a simple points total. Let’s sort our values and view the top 5 players:

In [5]:
data.sort_values(by='total_points_p90', ascending =False).head()
Out[5]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
271 Tuanzebe 1 Axel Tuanzebe 38 39 0 1.7 1 1.0 0 0 0 0 0 3 0.0 2 12 90.0
322 Sims 20 Joshua Sims 39 43 0 0.1 1 1.0 0 0 0 0 0 3 0.0 3 14 90.0
394 Janssen 6 Vincent Janssen 9 74 0 0.1 1 1.0 0 0 0 0 0 2 0.0 4 17 90.0
166 Hefele 38 Michael Hefele 44 42 0 0.1 1 1.0 0 0 0 0 0 4 0.4 2 8 90.0
585 Silva 13 Adrien Sebastian Perruchet Silva 14 60 0 0.0 1 1.0 0 0 0 0 0 5 0.3 3 9 22.5

5 rows × 27 columns

Huh, probably not what we expected here… players with 1 point, and some surprisng names too. Upon further examination, these players suffer from their sample size. They’ve played very few minutes, so their numbers get overly inflated… there’s obviously no way a player gets that many points per 90!

Let’s set a minimum time played to our data to eliminate players without a big enough sample:

In [6]:
data.sort_values(by='total_points_p90', ascending =False)[data.minutes>400].head(10)[["web_name","total_points_p90"]]
Out[6]:
web_name total_points_p90
233 Salah 9.629408
279 Martial 8.927126
246 Sterling 8.378721
225 Coutinho 8.358882
325 Austin 8.003356
278 Lingard 7.951807
544 Niasse 7.460317
256 Agüero 7.346939
389 Son 7.288503
255 Bernardo Silva 7.119403

That seems a bit more like it! We’ve got some of the highest scoring players here, like Salah and Sterling, but if Austin, Lingard and Bernardo Silva can nail down long-term starting spots, we should certainly keep an eye on adding them in!

Let’s go back over this by creating a new column for goals per 90 and finding the top 10:

In [7]:
data["goals_p90"] = p90_Calculator(data.goals_scored,data.minutes)
data.goals_p90.fillna(0, inplace=True)
data.sort_values(by='goals_p90', ascending =False)[data.minutes>400].head(10)[["web_name","goals_p90"]]
Out[7]:
web_name goals_p90
233 Salah 0.968320
393 Kane 0.967222
325 Austin 0.906040
256 Agüero 0.823364
246 Sterling 0.797973
544 Niasse 0.793651
279 Martial 0.728745
258 Jesus 0.714995
278 Lingard 0.632530
160 Rooney 0.630252

Great job! Hopefully you can see that this is a much fairer way to rate our player data – whether for performance, fantasy football or media reporting purposes.

Summary

p90 data is a fundamental concept of football analytics. It is one of the first steps of cleaning our data and making it fit for comparisons. This article has shown how we can apply the concept quickly and easily to our data. For next steps, you might want to take a look at visualising this data, or looking at further analysis techniques.

Posted by FCPythonADMIN in Blog