Before starting the article, I’m obliged to mention that web scraping is a grey area legally and ethicaly in lots of circumstances. Please consider the positive and negative effects of what you scrape before doing so!
Warning over. Web scraping is a hugely powerful tool that, when done properly, can give you access to huge, clean data sources to power your analysis. The applications are just about endless for anyone interested in data. As a professional analyst, you can scrape fixtures and line-up data from around the world every day to plan scouting assignments or alert you to youth players breaking through. As an amateur analyst, it is quite likely to be your only source of data for analysis.
This tutorial is just an introduction for Python scraping. It will take you through the basic process of loading a page, locating information and retrieving it. Combine the knowledge on this page with for loops to cycle through a site and HTML knowledge to understand a web page, and you’ll be armed with just about any data you can find.
Let’s fire up our modules & get started. We’ll need requests (to access and process web pages with Python) and beautifulsoup (to make sense of the code that makes up the pages) so make sure you have these installed.
import requests
from bs4 import BeautifulSoup
import pandas as pd
Our process for extracting data is going to go something like this:
- Load the webpage containing the data.
- Locate the data within the page and extract it.
- Organise the data into a dataframe
For this example, we are going to take the player names and values for the most expensive players in a particular year. You can find the page that we’ll use here.
The following sections will run through each of these steps individually.
Load the webpage containing the data
The very first thing that we are going to do is create a variable called ‘headers’ and assign it a string that will tell the website that we are a browser, and not a scraping tool. In short, we’ll be blocked if we are thought to be scraping!
Next, we have three lines. The first one assigns the address that we want to scrape to a variable called ‘page’.
The second uses the requests library to grab the code of the page and assign it to ‘pageTree’. We use our headers variable here to inform the site that we are pretending to be a human browser.
Finally, the BeautifulSoup module parses the website code into html. We will then be able to search through this for the data that we want to extract. This is saved to ‘pageSoup’, and you can find all three lines here:
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/transfers/transferrekorde/statistik/top/plus/0/galerie/0?saison_id=2000"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Locate the data within a page & extract it
To fully appreciate what we are doing here, you probably need a basic grasp of HTML – the language that structures a webpage. As simply as I can put it for this article, HTML is made up of elements, like a paragraph or a link, that tell the browser what to render. For scraping, we will use this information to tell our program what information to take.
Take another look at the page we are scraping. We want two things – the player name and the transfer value.
The player name is a link. This is denoted as an ‘a’ tag in HTML, so we will use the ‘find_all’ function to look for all of the a tags in the page. However, there are obviously lots of links! Fortunately, we can use the class given to the players’ names specifically on this page to only take these ones – the class name is passed to the ‘find_all’ function as a dictionary.
This function will return a list with all elements that match our criteria.
If you’re curious, classes are usually used to apply styles (such as colour or border) to elements in HTML.
The code to extract the players names is here:
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
#Let's look at the first name in the Players list.
Players[0].text
Looks like that works! Now let’s take the values.
As you can see on the page, the values are not a link, so we need to find a new feature to identify them by.
They are in a table cell, denoted by ‘td’ in HTML, so let’s look for that. The class to highlight these cells specifically is ‘rechts hauptlink’, as you’ll see below.
Let’s assign this to Values and check Figo’s transfer value:
Values = pageSoup.find_all("td", {"class": "rechts hauptlink"})
Values[0].text
That’s a lot of money! Even in today’s market! But according to the page, our data is correct. Now all we need to do is process the data into a dataframe for further analysis or to save for use elsewhere.
Organise the data into a dataframe
This is pretty simple, we know that there are 25 players in the list, so let’s use a for loop to add the first 25 players and value to new lists (to ensure that no stragglers elsewhere in the page jump on). With these new lists, we’ll just create a new dataframe with them:
PlayersList = []
ValuesList = []
for i in range(0,25):
PlayersList.append(Players[i].text)
ValuesList.append(Values[i].text)
df = pd.DataFrame({"Players":PlayersList,"Values":ValuesList})
df.head()
And now we have a dataframe with our scraped data, pretty much ready for analysis!
Summary
This article has gone through the absolute basics of scraping, we can now load a page, identify elements that we want to scrape and then process them into a dataframe.
There is more that we need to do to scrape efficiently though. Firstly, we can apply a for loop to the whole program above, changing the initial webpage name slightly to scrape the next year – I’ll let you figure out how!
You will also need to understand more about HTML, particularly class and ID selectors, to get the most out of scraping. Regardless, if you’ve followed along and understand what we’ve achieved and how, then you’re in a good place to apply this to other pages.
The techniques in this article gave the data for our joyplots tutorial, why not take a read of that next?