In this tutorial, we’ll be looking to develop our scraping knowledge beyond just lifting text from a single page. Following through the article, you’ll learn how to scrape links from a page and iterate through them to take information from each link, speeding up the process of creating new datasets. We will also run through how to identify and download images, creating a database of every player in the Premier League’s picture. This should save 10 minutes a week for anyone searching in Google Images to decorate their pre-match presentations!

This tutorial builds on the first article in our scraping series, so it is strongly recommended that you understand the concepts there before starting here.

Let’s import our modules and get started. Requests and BeautifulSoup will be recognised from last time, but os.path might be new. Os.path allows us to manipulate and utilise the operating system file structure, while basename gives us the ability to change and add file names – we’ll need this to give our pictures a proper name.

In [1]:

import requests
from bs4 import BeautifulSoup
from os.path  import basename

Our aim is to extract a picture of every player in the Premier League. We have identified Transfermarkt as our target, given that each player page should have a picture. Our secondary aim is to run this in one piece of code and not to run a new command for each player or team individually. To do this, we need to follow this process:

1) Locate a list of teams in the league with links to a squad list – then save these links

2) Run through each squad list link and save the link to each player’s page

3) Locate the player’s image and save it to our local computer

For what seems to be a massive task, we can distill it down to three main tasks. Below, we’ll break each one down.

Firstly, however, we need to set our headers to give the appearance of a human user when we call for data from Transfermarkt.

In [2]:

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

Locate a list of team links and save them

The Premier League page is the obvious place to start. As you can see, each team name is a link through to the squad page.

All that we need to do is process the page with BeautifulSoup (check the first article for more details) and identify the team links with ‘soup.select()’ with the links’ css selectors. These links should be added to a list for later.

Finally, we append these links to the transfermarkt domain so that we can call them on their own.

Check out the annotated code below for detailed instructions:

In [3]:

#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

Run through each squad and save the player links

So we now have 20 team links, with each looking like this:

In [4]:

teamLinks[14]

Out[4]:

'https://www.transfermarkt.co.uk/leicester-city/startseite/verein/1003/saison_id/2017'

We will now iterate through each of these team links and do the same thing, only this time we are taking player links and not squad links. Take a look through the code below, but you’ll notice that it is very similar to the last chunk of instructions – the key difference being that we will run it within a loop to go through all 20 teams in one go.

In [5]:

#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

Locate and save each player’s image

We now have a lot of links for players…

In [6]:

len(playerLinks)

Out[6]:

526 links, in fact! We now need to iterate through each of these links and save the player’s picture.

Hopefully you should now be comfortable with the process to download and process a webpage, but the second part of this step will need some unpacking – locating the image and saving it.

Once again, we are locating elements in the page. When we try to identify the correct image on the page, it seems that the best way to do this is through the ‘title’ attribute – which is the player’s name. It is ridiculous for us to manually enter the name for each one, so we need to find this elsewhere on the page. Fortunately, it is easy to find this as it is the only ‘h1’ elemnt.

Subsequently, we assign this name to the name variable, then use it to call the correct image.

When we call the image, we actually need to call the location where the image is saved on the website’s server. We do this by calling for the image’s source. The source contains some extra information that we don’t need, so we use .split() to isolate the information that we do need and save that to our ‘src’ variable.

The final thing to do is to save the image from this source location. We do this by opening a new file named after the player, then save the content from source to the new file. Incredibly, Python does this in just two lines. All images will be saved into the folder that your Python notebook or file is saved.

Try and follow through the code below with these instructions:

In [7]:

for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

This will take a couple of minutes to run, as we have 526 images to find and save. However, this short investment of time will save you 10 minutes each week in finding these pictures. Additionally, just change the link from the Premier League table to apply the code to any other league (assuming Transfermarkt is laid out in the same way!).

Your folder should now look something like this:

Summary

The aim of this article is to demonstrate two things. Firstly, how to collect links from a page and loop through them to further automate scraping. We have seen two examples of this – collect team and player links. It is clear to see how taking a bigger approach to scraping and understanding a website’s structure, we can collect information en masse, saving lots of time in the future.

Secondly, how to collect and save images. This article explains that images are saved on the website’s server, and we must locate where they are and save them from this location. Python makes this idea simple in execution as we can save from a location in just two lines. Also, by combining this with our iterations through players and times, we can save 526 pictures in a matter of minutes!

For further development, you may want to expand the data that you collect from each player, apply this logic to different sites, or even learn about navigating through your files to save players in team folders.

For your next FC Python course, why not take a look at our visualisation tutorials?

beautifulsoup requests