requests

Scraping Premier League Football Data with Python

We’ve already seen in previous scraping articles how we can identify parts on a web page and scrape them into a dataframe. We would strongly recommend taking a look through our introductory piece on scraping before pressing forward here.

Another previous article stored player images for match/scouting reports, pre-match presentations, etc. This time, we will be looking to collate the heights, appearances and weights of each Premier League player from the official site. Let’s get our modules imported and run through our process:

In [1]:
from lxml import html
import requests
import pandas as pd
import numpy as np
import re

Take a look at a player page from the Premier League site. There is loads of information here, but we are interested in collecting the apps, height and weight data.

We could do this manually for each player of each team, but hopefully we can also scrape through a list of each player in each team, and a list of each team in the league to automate the process entirely. Subsequently, our plan for the code is going to be something like this:

  • Scrape the clubs page and make a list of each team page
  • Scrape each team’s page for players and make a list of each player
  • Scrape each player page and take their height, weight and apps number
    • Save this into a table for later analysis

Read the Clubs page and list each team

Our intro article goes through this in much more depth, so take a look there if any of the code below is confusing.

In short, we need to download the html of the page and identify the links pointing towards the teams. We then save this into a list that we can use later.

Take a look through the code and try to follow along. A reminder that more detail is here if you need it!

In [2]:
#Take site and structure html
page = requests.get('https://www.premierleague.com/clubs')
tree = html.fromstring(page.content)
In [3]:
#Using the page's CSS classes, extract all links pointing to a team
linkLocation = tree.cssselect('.indexItem')

#Create an empty list for us to send each team's link to
teamLinks = []

#For each link...
for i in range(0,20):
    
    #...Find the page the link is going to...
    temp = linkLocation[i].attrib['href']
    
    #...Add the link to the website domain...
    temp = "http://www.premierleague.com/" + temp
    
    #...Change the link text so that it points to the squad list, not the page overview...
    temp = temp.replace("overview", "squad")
    
    #...Add the finished link to our teamLinks list...
    teamLinks.append(temp)
    

Our process here is very similar to the first step, now we are just looking to create a longer list of each player, not each team.

The main difference is that we will create two links, as the data that we need is across both the player overview page, and the player stats page.

Once again, if anything here is confusing, check out the intro to scraping piece! You might also want to check out the for loop page as we have a nested for loop in this part of the code!

In [4]:
#Create empty lists for player links
playerLink1 = []
playerLink2 = []

#For each team link page...
for i in range(len(teamLinks)):
    
    #...Download the team page and process the html code...
    squadPage = requests.get(teamLinks[i])
    squadTree = html.fromstring(squadPage.content)
    
    #...Extract the player links...
    playerLocation = squadTree.cssselect('.playerOverviewCard')

    #...For each player link within the team page...
    for i in range(len(playerLocation)):
        
        #...Save the link, complete with domain...
        playerLink1.append("http://www.premierleague.com/" + playerLocation[i].attrib['href'])
        
        #...For the second link, change the page from player overview to stats
        playerLink2.append(playerLink1[i].replace("overview", "stats"))

Scrape each player’s page for their age, apps, height and weight data

If you have been able to follow along with the previous steps, you’ll be absolutely fine here too. The steps are very similar again, just this time we are looking to store data, not links.

We will start this step by defining empty lists for the datapoints we intend to capture. Afterwards, we’ll work through each player link to save the player’s details. We will also add a little line of code to add in some blank data if the site is missing any details – this should allow us to run without any errors. After collecting each player’s data, we will simply add it to the lists.

Let’s collect the data into lists, before we put it into a dataframe and save it as a spreadsheet:

In [5]:
#Create lists for each variable
Name = []
Team = []
Age = []
Apps = []
HeightCM = []
WeightKG = []


#Populate lists with each player

#For each player...
for i in range(len(playerLink1)):

    #...download and process the two pages collected earlier...
    playerPage1 = requests.get(playerLink1[i])
    playerTree1 = html.fromstring(playerPage1.content)
    playerPage2 = requests.get(playerLink2[i])
    playerTree2 = html.fromstring(playerPage2.content)

    #...find the relevant datapoint for each player, starting with name...
    tempName = str(playerTree1.cssselect('div.name')[0].text_content())
    
    #...and team, but if there isn't a team, return "BLANK"...
    try:
        tempTeam = str(playerTree1.cssselect('.table:nth-child(1) .long')[0].text_content())
    except IndexError:
        tempTeam = str("BLANK")
    
    #...and age, but if this isn't there, leave a blank 'no number' number...
    try:  
        tempAge = int(playerTree1.cssselect('.pdcol2 li:nth-child(1) .info')[0].text_content())
    except IndexError:
        tempAge = float('NaN')

    #...and appearances. This is a bit of a mess on the page, so tidy it first...
    try:
        tempApps = playerTree2.cssselect('.statappearances')[0].text_content()
        tempApps = int(re.search(r'\d+', tempApps).group())
    except IndexError:
        tempApps = float('NaN')

    #...and height. Needs tidying again...
    try:
        tempHeight = playerTree1.cssselect('.pdcol3 li:nth-child(1) .info')[0].text_content()
        tempHeight = int(re.search(r'\d+', tempHeight).group())
    except IndexError:
        tempHeight = float('NaN')

    #...and weight. Same with tidying and returning blanks if it isn't there
    try:
        tempWeight = playerTree1.cssselect('.pdcol3 li+ li .info')[0].text_content()
        tempWeight = int(re.search(r'\d+', tempWeight).group())
    except IndexError:
        tempWeight = float('NaN')


    #Now that we have a player's full details - add them all to the lists
    Name.append(tempName)
    Team.append(tempTeam)
    Age.append(tempAge)
    Apps.append(tempApps)
    HeightCM.append(tempHeight)
    WeightKG.append(tempWeight)

Saving our lists to a dataframe

You’ll have noticed that if the data wasn’t available, we add a blank item to the list instead. This is really important as it keeps all of our lists at the same length and means that player data is all in the same row.

We can now add this to a dataframe, made ridiculously easy through the pandas module. Let’s create it and check out our data:

In [6]:
#Create data frame from lists
df = pd.DataFrame(
    {'Name':Name,
     'Team':Team,
     'Age':Age,
     'Apps':Apps,
     'HeightCM':HeightCM,
     'WeightKG':WeightKG})

#Show me the top 3 rows:

df.head()
Out[6]:
Age Apps HeightCM Name Team WeightKG
0 29 25 183.0 David Ospina Arsenal 80.0
1 35 432 196.0 Petr Cech Arsenal 90.0
2 23 0 198.0 Matt Macey Arsenal 81.0
3 23 0 195.0 Dejan Iliev Arsenal 87.0
4 24 22 183.0 Sead Kolasinac Arsenal 85.0
In [7]:
#Show me Karius' height:

df[df['Name']=="Loris Karius"]["HeightCM"]
Out[7]:
272    189.0
Name: HeightCM, dtype: float64

Everything seems to check out, so you’re now free to use this data in Python for analysis or visualisation, or you may want to export it for use elsewhere, with the ‘.to_csv’ function:

In [8]:
df.to_csv("EPLData.csv")

One slight caveat with this dataset is that it includes players on loan – you may want to exclude them. Check out the data analysis course out to learn about cleaning your datasets.

Summary

In this article, we’ve covered a lot of fundamental Python tasks through scraping, including for loops, lists and data frames – in addition to increasingly complex ideas like processing html and css classes. If you’ve followed along, great work! But there’s no reason not to go back over these topics to make sure you’ve got a decent understanding of them.

Next up, you might want to take a look at visualising some of the age data to check out team profiles!

Posted by FCPythonADMIN in Scraping

Calling an API with Python Requests – Visualising ClubElo data

Working in Python, your data is likely to come from a number of different places – spreadsheets, databases or elsewhere. Eventually, you will find that some interesting and useful data for you will be available through a web API – a stream of data that you will need to call from, download and format for your analysis.

This article will introduce calling an API with the requests library, before formatting it into a dataframe and visualising it. Our example makes use of the fantastic work done at clubelo.com – a site that applies the elo rating system to football. Their API is easy to use and provides us with a great opportunity to learn about the processes in this article!

Let’s get our modules together and get started:

In [1]:
import requests
import csv
from io import StringIO
import pandas as pd
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns

Calling an API

Downloading a dataset through an API must be complicated, surely? Of course, Python and its libraries make this as simple as possible. The requests library will do this quickly and easily with the ‘.get’ function. All we need to do is provide the api location that we want to read from. Other APIs will require authentification, but for now, we just need to provide the API address.

In [2]:
r = requests.get('http://api.clubelo.com/ManCity')

 

If you would like to run the tutorial with a different team, take a look at the instructions here and find your club on the site to find the correct name to use.

Our new ‘r’ variable contains a lot of information. It will hold the data that we will analyse, the address that we called from and a status code to let us know if it worked or not. Let’s check our status code:

In [3]:
r.status_code
Out[3]:
200

There are dozens of status codes, which you can find here, but we are hoping for a 200 code, telling us that the call went through as planned.

Now that we know that our request has made its way back, let’s check out what the API gives us with .text applied to the request (we have shortened the export dramatically, but it carries on as you see below):

In [4]:
r.text
Out[4]:

‘Rank,Club,Country,Level,Elo,From,To\n
None,Man City,ENG,2,1365.06604004,1946-07-07,1946-09-04\n

We’re given a load of text that, if you read carefully, is separated by commas and ‘\n’. Hopefully you recognise that this could be a CSV file!

 

Formatting our request data

We need to turn this into a spreadsheet-style dataframe in order to do anything with it. We will do this in two steps, firstly assigning this text to a readable csv variable with the StringIO library. We can then use Pandas to turn it into a dataframe. Check out how below:

In [5]:
data = StringIO(r.text)
df = pd.read_csv(data, sep=",")

df.head()
Out[5]:
Rank Club Country Level Elo From To
0 None Man City ENG 2 1365.066040 1946-07-07 1946-09-04
1 None Man City ENG 2 1372.480469 1946-09-05 1946-09-07
2 None Man City ENG 2 1369.613770 1946-09-08 1946-09-14
3 None Man City ENG 2 1383.733887 1946-09-15 1946-09-18
4 None Man City ENG 2 1385.578369 1946-09-19 1946-09-21

Awesome, we have a dataframe that we can analyse and visualise! One more thing that we need to format are the date columns. By default, they are strings of text and we need to reformat them to utilise the date functionality in our analysis.

Pandas makes this easy with the .to_datetime() function. Let’s reassign the from and to columns with this:

In [6]:
df.From = pd.to_datetime(df['From'])
df.To = pd.to_datetime(df['To'])

Visualising the data

The most obvious visualisation of this data is the journey that a team’s elo rating has taken.

As we have created our date columns, we can use matplotlib’s plot_date to easily create a time series chart. Let’s fire one off with our data that we’ve already set up:

In [7]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
plt.figure(num=None, figsize=(10, 4), dpi=80)

#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("Manchester City - Elo Rating", loc="left", fontsize=15)

#Display the chart
plt.show()
Manchester City Elo Chart

 

 

 

 

 

 

 

 

 

And let’s change a couple of the style options with matplotlib to tidy this up a bit. Hopefully you can figure out how we have changed the background colour, text style and size by reading through the two pieces of code.

In [8]:
#Set the visual style of the chart with Seaborn Set the size of our chart with matplotlib
sns.set_style("dark")
fig = plt.figure(num=None, figsize=(15, 5), dpi=600)
axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
fig.patch.set_facecolor('#edeeef')


#Plot the elo column along the from dates, as a line ("-"), and in City's colour
plt.plot_date(df.From, df.Elo,'-', color="deepskyblue")

#Set a title, write it on the left hand side
plt.title("   Manchester City - Elo Rating", loc="left", fontsize=18, fontname="Arial Rounded MT Bold")

#Display the chart
plt.show()
Manchester City Elo Evolution

Now this is a lot of code to piece through and do group-by-group, so let’s create a function to do it all in one go. Try and read through it carefully, matching it to the steps above.

In [9]:
def plotClub(team, colour = "dimgray"):
    r = requests.get('http://api.clubelo.com/' + str(team))
    data = StringIO(r.text)
    df = pd.read_csv(data, sep=",")
    
    df.From = pd.to_datetime(df['From'])
    df.To = pd.to_datetime(df['To'])
    
    sns.set_style("dark")
    fig = plt.figure(num=None, figsize=(12, 4), dpi=600)
    axes = fig.add_subplot(1, 1, 1, facecolor='#edeeef')
    fig.patch.set_facecolor('#edeeef')    
    plt.plot_date(df.From, df.Elo,'-', color = colour)
    plt.title("    " + str(team) + " - Elo Rating", loc="left",  fontsize=18, fontname="Arial Rounded MT Bold")
    plt.show()

And let’s give it a go…

In [10]:
plotClub("RBLeipzig", "red")
 RB Leipzig Elo evolution

So we’re now calling, tidying and plotting our request in one go! Great work! Can you create a plot that compares two teams? Take a look through the matplotlib documentation to learn more about customising these plots too!

Of course, repeatedly calling an API is bad practice, so perhaps work on calling the data and storing it locally instead of making the same request over and over.

Summary

Being able to call data and structure it for analysis is a crucial skill to pick up and develop. This article has introduced the topic with the readily available and easy-to-utilise API available from clubelo. We owe them a thank you for their help and permission in putting this piece together!

To develop here, you should work on calling from APIs and storing the data for later use in a user-friendly format. Take a look at other sport and non-sport APIs and get practicing!

If you would like to learn more about formatting our charts like we have done above, take a look through some rules and code for better visualisations in Python.

Posted by FCPythonADMIN in Blog

Scraping Lists Through Transfermarkt and Saving Images

In this tutorial, we’ll be looking to develop our scraping knowledge beyond just lifting text from a single page. Following through the article, you’ll learn how to scrape links from a page and iterate through them to take information from each link, speeding up the process of creating new datasets. We will also run through how to identify and download images, creating a database of every player in the Premier League’s picture. This should save 10 minutes a week for anyone searching in Google Images to decorate their pre-match presentations!

This tutorial builds on the first article in our scraping series, so it is strongly recommended that you understand the concepts there before starting here.

Let’s import our modules and get started. Requests and BeautifulSoup will be recognised from last time, but os.path might be new. Os.path allows us to manipulate and utilise the operating system file structure, while basename gives us the ability to change and add file names – we’ll need this to give our pictures a proper name.

In [1]:
import requests
from bs4 import BeautifulSoup
from os.path  import basename

Our aim is to extract a picture of every player in the Premier League. We have identified Transfermarkt as our target, given that each player page should have a picture. Our secondary aim is to run this in one piece of code and not to run a new command for each player or team individually. To do this, we need to follow this process:

1) Locate a list of teams in the league with links to a squad list – then save these links

2) Run through each squad list link and save the link to each player’s page

3) Locate the player’s image and save it to our local computer

For what seems to be a massive task, we can distill it down to three main tasks. Below, we’ll break each one down.

Firstly, however, we need to set our headers to give the appearance of a human user when we call for data from Transfermarkt.

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

The Premier League page is the obvious place to start. As you can see, each team name is a link through to the squad page.

All that we need to do is process the page with BeautifulSoup (check the first article for more details) and identify the team links with ‘soup.select()’ with the links’ css selectors. These links should be added to a list for later.

Finally, we append these links to the transfermarkt domain so that we can call them on their own.

Check out the annotated code below for detailed instructions:

In [3]:
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    

So we now have 20 team links, with each looking like this:

In [4]:
teamLinks[14]
Out[4]:
'https://www.transfermarkt.co.uk/leicester-city/startseite/verein/1003/saison_id/2017'

We will now iterate through each of these team links and do the same thing, only this time we are taking player links and not squad links. Take a look through the code below, but you’ll notice that it is very similar to the last chunk of instructions – the key difference being that we will run it within a loop to go through all 20 teams in one go.

In [5]:
#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

Locate and save each player’s image

We now have a lot of links for players…

In [6]:
len(playerLinks)
Out[6]:
526

526 links, in fact! We now need to iterate through each of these links and save the player’s picture.

Hopefully you should now be comfortable with the process to download and process a webpage, but the second part of this step will need some unpacking – locating the image and saving it.

Once again, we are locating elements in the page. When we try to identify the correct image on the page, it seems that the best way to do this is through the ‘title’ attribute – which is the player’s name. It is ridiculous for us to manually enter the name for each one, so we need to find this elsewhere on the page. Fortunately, it is easy to find this as it is the only ‘h1’ elemnt.

Subsequently, we assign this name to the name variable, then use it to call the correct image.

When we call the image, we actually need to call the location where the image is saved on the website’s server. We do this by calling for the image’s source. The source contains some extra information that we don’t need, so we use .split() to isolate the information that we do need and save that to our ‘src’ variable.

The final thing to do is to save the image from this source location. We do this by opening a new file named after the player, then save the content from source to the new file. Incredibly, Python does this in just two lines. All images will be saved into the folder that your Python notebook or file is saved.

Try and follow through the code below with these instructions:

In [7]:
for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

This will take a couple of minutes to run, as we have 526 images to find and save. However, this short investment of time will save you 10 minutes each week in finding these pictures. Additionally, just change the link from the Premier League table to apply the code to any other league (assuming Transfermarkt is laid out in the same way!).

Your folder should now look something like this:

Images scraped from Transfermarkt

Summary

The aim of this article is to demonstrate two things. Firstly, how to collect links from a page and loop through them to further automate scraping. We have seen two examples of this – collect team and player links. It is clear to see how taking a bigger approach to scraping and understanding a website’s structure, we can collect information en masse, saving lots of time in the future.

Secondly, how to collect and save images. This article explains that images are saved on the website’s server, and we must locate where they are and save them from this location. Python makes this idea simple in execution as we can save from a location in just two lines. Also, by combining this with our iterations through players and times, we can save 526 pictures in a matter of minutes!

For further development, you may want to expand the data that you collect from each player, apply this logic to different sites, or even learn about navigating through your files to save players in team folders.

For your next FC Python course, why not take a look at our visualisation tutorials?

Posted by FCPythonADMIN in Blog, Scraping