beautifulsoup

Scraping Lists Through Transfermarkt and Saving Images

In this tutorial, we’ll be looking to develop our scraping knowledge beyond just lifting text from a single page. Following through the article, you’ll learn how to scrape links from a page and iterate through them to take information from each link, speeding up the process of creating new datasets. We will also run through how to identify and download images, creating a database of every player in the Premier League’s picture. This should save 10 minutes a week for anyone searching in Google Images to decorate their pre-match presentations!

This tutorial builds on the first article in our scraping series, so it is strongly recommended that you understand the concepts there before starting here.

Let’s import our modules and get started. Requests and BeautifulSoup will be recognised from last time, but os.path might be new. Os.path allows us to manipulate and utilise the operating system file structure, while basename gives us the ability to change and add file names – we’ll need this to give our pictures a proper name.

In [1]:
import requests
from bs4 import BeautifulSoup
from os.path  import basename

Our aim is to extract a picture of every player in the Premier League. We have identified Transfermarkt as our target, given that each player page should have a picture. Our secondary aim is to run this in one piece of code and not to run a new command for each player or team individually. To do this, we need to follow this process:

1) Locate a list of teams in the league with links to a squad list – then save these links

2) Run through each squad list link and save the link to each player’s page

3) Locate the player’s image and save it to our local computer

For what seems to be a massive task, we can distill it down to three main tasks. Below, we’ll break each one down.

Firstly, however, we need to set our headers to give the appearance of a human user when we call for data from Transfermarkt.

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

The Premier League page is the obvious place to start. As you can see, each team name is a link through to the squad page.

All that we need to do is process the page with BeautifulSoup (check the first article for more details) and identify the team links with ‘soup.select()’ with the links’ css selectors. These links should be added to a list for later.

Finally, we append these links to the transfermarkt domain so that we can call them on their own.

Check out the annotated code below for detailed instructions:

In [3]:
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
    

So we now have 20 team links, with each looking like this:

In [4]:
teamLinks[14]
Out[4]:
'https://www.transfermarkt.co.uk/leicester-city/startseite/verein/1003/saison_id/2017'

We will now iterate through each of these team links and do the same thing, only this time we are taking player links and not squad links. Take a look through the code below, but you’ll notice that it is very similar to the last chunk of instructions – the key difference being that we will run it within a loop to go through all 20 teams in one go.

In [5]:
#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

Locate and save each player’s image

We now have a lot of links for players…

In [6]:
len(playerLinks)
Out[6]:
526

526 links, in fact! We now need to iterate through each of these links and save the player’s picture.

Hopefully you should now be comfortable with the process to download and process a webpage, but the second part of this step will need some unpacking – locating the image and saving it.

Once again, we are locating elements in the page. When we try to identify the correct image on the page, it seems that the best way to do this is through the ‘title’ attribute – which is the player’s name. It is ridiculous for us to manually enter the name for each one, so we need to find this elsewhere on the page. Fortunately, it is easy to find this as it is the only ‘h1’ elemnt.

Subsequently, we assign this name to the name variable, then use it to call the correct image.

When we call the image, we actually need to call the location where the image is saved on the website’s server. We do this by calling for the image’s source. The source contains some extra information that we don’t need, so we use .split() to isolate the information that we do need and save that to our ‘src’ variable.

The final thing to do is to save the image from this source location. We do this by opening a new file named after the player, then save the content from source to the new file. Incredibly, Python does this in just two lines. All images will be saved into the folder that your Python notebook or file is saved.

Try and follow through the code below with these instructions:

In [7]:
for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

This will take a couple of minutes to run, as we have 526 images to find and save. However, this short investment of time will save you 10 minutes each week in finding these pictures. Additionally, just change the link from the Premier League table to apply the code to any other league (assuming Transfermarkt is laid out in the same way!).

Your folder should now look something like this:

Images scraped from Transfermarkt

Summary

The aim of this article is to demonstrate two things. Firstly, how to collect links from a page and loop through them to further automate scraping. We have seen two examples of this – collect team and player links. It is clear to see how taking a bigger approach to scraping and understanding a website’s structure, we can collect information en masse, saving lots of time in the future.

Secondly, how to collect and save images. This article explains that images are saved on the website’s server, and we must locate where they are and save them from this location. Python makes this idea simple in execution as we can save from a location in just two lines. Also, by combining this with our iterations through players and times, we can save 526 pictures in a matter of minutes!

For further development, you may want to expand the data that you collect from each player, apply this logic to different sites, or even learn about navigating through your files to save players in team folders.

For your next FC Python course, why not take a look at our visualisation tutorials?

Posted by FCPythonADMIN in Blog, Scraping

Introduction to Scraping Data from Transfermarkt

Before starting the article, I’m obliged to mention that web scraping is a grey area legally and ethicaly in lots of circumstances. Please consider the positive and negative effects of what you scrape before doing so!

Warning over. Web scraping is a hugely powerful tool that, when done properly, can give you access to huge, clean data sources to power your analysis. The applications are just about endless for anyone interested in data. As a professional analyst, you can scrape fixtures and line-up data from around the world every day to plan scouting assignments or alert you to youth players breaking through. As an amateur analyst, it is quite likely to be your only source of data for analysis.

This tutorial is just an introduction for Python scraping. It will take you through the basic process of loading a page, locating information and retrieving it. Combine the knowledge on this page with for loops to cycle through a site and HTML knowledge to understand a web page, and you’ll be armed with just about any data you can find.

Let’s fire up our modules & get started. We’ll need requests (to access and process web pages with Python) and beautifulsoup (to make sense of the code that makes up the pages) so make sure you have these installed.

In [1]:
import requests
from bs4 import BeautifulSoup

import pandas as pd

Our process for extracting data is going to go something like this:

  1. Load the webpage containing the data.
  2. Locate the data within the page and extract it.
  3. Organise the data into a dataframe

For this example, we are going to take the player names and values for the most expensive players in a particular year. You can find the page that we’ll use here.

The following sections will run through each of these steps individually.

Load the webpage containing the data

The very first thing that we are going to do is create a variable called ‘headers’ and assign it a string that will tell the website that we are a browser, and not a scraping tool. In short, we’ll be blocked if we are thought to be scraping!

Next, we have three lines. The first one assigns the address that we want to scrape to a variable called ‘page’.

The second uses the requests library to grab the code of the page and assign it to ‘pageTree’. We use our headers variable here to inform the site that we are pretending to be a human browser.

Finally, the BeautifulSoup module parses the website code into html. We will then be able to search through this for the data that we want to extract. This is saved to ‘pageSoup’, and you can find all three lines here:

In [2]:
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Locate the data within a page & extract it

To fully appreciate what we are doing here, you probably need a basic grasp of HTML – the language that structures a webpage. As simply as I can put it for this article, HTML is made up of elements, like a paragraph or a link, that tell the browser what to render. For scraping, we will use this information to tell our program what information to take.

Take another look at the page we are scraping. We want two things – the player name and the transfer value.

The player name is a link. This is denoted as an ‘a’ tag in HTML, so we will use the ‘find_all’ function to look for all of the a tags in the page. However, there are obviously lots of links! Fortunately, we can use the class given to the players’ names specifically on this page to only take these ones – the class name is passed to the ‘find_all’ function as a dictionary.

This function will return a list with all elements that match our criteria.

If you’re curious, classes are usually used to apply styles (such as colour or border) to elements in HTML.

The code to extract the players names is here:

In [3]:
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})

#Let's look at the first name in the Players list.
Players[0].text
Out[3]:
'Luís Figo'

Looks like that works! Now let’s take the values.

As you can see on the page, the values are not a link, so we need to find a new feature to identify them by.

They are in a table cell, denoted by ‘td’ in HTML, so let’s look for that. The class to highlight these cells specifically is ‘rechts hauptlink’, as you’ll see below.

Let’s assign this to Values and check Figo’s transfer value:

In [4]:
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})

Values[0].text
Out[4]:
'£54.00m'

That’s a lot of money! Even in today’s market! But according to the page, our data is correct. Now all we need to do is process the data into a dataframe for further analysis or to save for use elsewhere.

Organise the data into a dataframe

This is pretty simple, we know that there are 25 players in the list, so let’s use a for loop to add the first 25 players and value to new lists (to ensure that no stragglers elsewhere in the page jump on). With these new lists, we’ll just create a new dataframe with them:

In [5]:
PlayersList = []
ValuesList = []

for i in range(0,25):
    PlayersList.append(Players[i].text)
    ValuesList.append(Values[i].text)
    
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})

df.head()
Out[5]:
Players Values
0 Luís Figo £54.00m
1 Hernán Crespo £51.13m
2 Marc Overmars £36.00m
3 Gabriel Batistuta £32.54m
4 Nicolas Anelka £31.05m

And now we have a dataframe with our scraped data, pretty much ready for analysis!

Summary

This article has gone through the absolute basics of scraping, we can now load a page, identify elements that we want to scrape and then process them into a dataframe.

There is more that we need to do to scrape efficiently though. Firstly, we can apply a for loop to the whole program above, changing the initial webpage name slightly to scrape the next year – I’ll let you figure out how!

You will also need to understand more about HTML, particularly class and ID selectors, to get the most out of scraping. Regardless, if you’ve followed along and understand what we’ve achieved and how, then you’re in a good place to apply this to other pages.

The techniques in this article gave the data for our joyplots tutorial, why not take a read of that next?

Posted by FCPythonADMIN in Blog, Scraping