# Opta

## Convex Hulls for Football in Python

Building on what you can do with event data from the Opta (or any other) event feed, we’re going to look at one way of visualising a team’s defensive actions. Popularised in the football analytics community by Thom Lawrence (please let us know if we should add anyone else!), convex hulls display the smallest area needed to cover a set of points:

In this tutorial, we’re going to go through selecting and preparing our data to create these, before plotting the hull. We’ll then apply this to a for loop to chart each player together to see where a team is being forced to defend.

For this article, we’ll be making use of the ConvexHull tools within the Scipy module. The wider module is a phenomenal resource for more complex maths needs in Python, so give it a look if you’re interested.

Outside of ConvexHull, we’ll need pandas and numpy for importing and manipulating data, while Matplotlib will plot our data. Let’s import them and get started:

In [1]:
```from scipy.spatial import ConvexHull

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.patches import Arc

%matplotlib inline
```

With the modules ready, we’re going to import our data. For this example, our data contains all defensive actions in one match, split by player and team.

Let’s take a look at how it is structured with .head():

In [2]:
```defdata = pd.read_csv("def_table.csv")
```
Out[2]:
player team minute x y outcome
0 50471 Team A 1 38.9 31.8 1
1 19197 Team A 6 52.6 68.4 1
2 42593 Team B 6 39.8 83.9 1
3 19188 Team A 7 3.5 37.9 1
4 82403 Team A 8 17.9 98.5 1

So each row is a defensive action, and we can see the x/y coordinates and who did it.

We just want one player’s actions, so we’ll create a new dataframe for the first player ID – 50471:

In [3]:
```player50471 = defdata.loc[(defdata['player'] == 50471)]

```
Out[3]:
player team minute x y outcome
0 50471 Team A 1 38.9 31.8 1
12 50471 Team A 22 30.0 33.2 1
13 50471 Team A 25 64.7 94.9 1
51 50471 Team A 65 31.2 32.2 1
56 50471 Team A 72 46.5 22.6 1

To create a convex hull, we need to build it from a list of coordinates. We have our coordinates in the dataframe already, but need them to look something close to the below:

(38.9, 31.8), (30.0, 33.2), (64.7, 94.9) and so on…

Thanks to the pandas module, this is made easy by adding .values to the end of the data that we want to see in arrays, rather than columns:

In [4]:
```defpoints = player50471[['x', 'y']].values

defpoints
```
Out[4]:
```array([[38.9, 31.8],
[30. , 33.2],
[64.7, 94.9],
[31.2, 32.2],
[46.5, 22.6],
[30.3, 49.8],
[22.9, 92.5]])```

Our data is now ready to be used to create our convex hull. By itself, it is actually pretty boring – it simply creates an object that does nothing at all by itself. Let’s see how this is done below:

In [5]:
```#Create a convex hull object and assign it to the variable hull
hull = ConvexHull(player50471[['x','y']])

#Display hull
hull
```
Out[5]:
`<scipy.spatial.qhull.ConvexHull at 0x1faa0c96dd8>`

See, that is pretty boring. But we can make it so much cooler when we plot the hull onto a chart.

Let’s start by plotting all 7 event locations as dots on a scatter chart:

In [6]:
```#Plot the X & Y location with dots
plt.plot(player50471.x,player50471.y, 'o')
```
Out[6]:
`[<matplotlib.lines.Line2D at 0x1faa2d10908>]`

Next up, we’re going to add lines around the most extreme parts of the plot. These most extreme parts are stored in a part of the hull object called simplices. We can just use a for loop to iterate through the simplices and draw lines between them:

In [7]:
```#Plot the X & Y location with dots
plt.plot(player50471.x,player50471.y, 'o')

#Loop through each of the hull's simplices
for simplex in hull.simplices:
#Draw a black line between each
plt.plot(defpoints[simplex, 0], defpoints[simplex, 1], 'k-')
```

Looks kind of abstract, but a lot more interesting than the hull object on its own!

Let’s just add in some shading to make our area even clearer. We’ll also make it 30% transparent with the alpha argument:

In [8]:
```#Plot the X & Y location with dots
plt.plot(player50471.x,player50471.y, 'o')

#Loop through each of the hull's simplices
for simplex in hull.simplices:
#Draw a black line between each
plt.plot(defpoints[simplex, 0], defpoints[simplex, 1], 'k-')

#Fill the area within the lines that we have drawn
plt.fill(defpoints[hull.vertices,0], defpoints[hull.vertices,1], 'k', alpha=0.3)
```
Out[8]:
`[<matplotlib.patches.Polygon at 0x1faa2f1bb70>]`

Perfect, we have one player’s zone of defensive actions plotted. We don’t have a pitch or any other players on there yet, but this is great work!

Let’s work on a bigger project now – let’s do all of this over and over for a whole team. We’ll take a single team out of our dataset, then use for loops to create the plot for each player (exactly as above) before plotting them together.

First up, let’s extract Team B into one dataframe:

In [9]:
```TeamB = defdata.loc[(defdata.team == "Team B")]
```
Out[9]:
player team minute x y outcome
2 42593 Team B 6 39.8 83.9 1
5 42593 Team B 8 44.7 91.5 1
6 17476 Team B 12 23.1 1.3 1
8 57112 Team B 17 4.4 57.7 1
9 42593 Team B 17 5.8 58.9 1

Perfect, just as before, but with different players on a single team.

We’ll now need to go through each player and do exactly what we did to plot just a single player. First up, we need to find out who we are dealing with. We can use .unique() to pool each individual into the variable ‘players’:

In [10]:
```players = TeamB["player"].unique()
players
```
Out[10]:
```array([42593, 17476, 57112, 27789, 14664, 61366, 37748, 57001, 28554,
17740], dtype=int64)```

Every player now just needs to go into a for loop, where we’ll do exactly what we did before to get a plot. We’ll create a temporary dataframe for each player, create a hull from the x/y coordinates, then plot the lines and fill in the shape with a transparent colour. Let’s take a look with the help of some comments:

In [11]:
```#For each player in our players variable
for player in players:

#Create a new dataframe for the player
df = TeamB[(TeamB.player == player)]

#Create an array of the x/y coordinate groups
points = df[['x', 'y']].values

#If there are enough points for a hull, create it. If there's an error, forget about it
try:
hull = ConvexHull(df[['x','y']])

except:
pass

#If we created the hull, draw the lines and fill with 5% transparent red. If there's an error, forget about it
try:
for simplex in hull.simplices:
plt.plot(points[simplex, 0], points[simplex, 1], 'k-')
plt.fill(points[hull.vertices,0], points[hull.vertices,1], 'red', alpha=0.05)

except:
pass

#Once all of the individual hulls have been created, plot them together
plt.show()
```

Fantastic work! We now have all of the players with enough data points on the chart. The transparency is a nice touch, as we can see any hidden players and where any crossover happens.

Our plot leaves out any players with less than 2 defensive actions in the data, so you may want to plot these as lines or dots. If so, you should be able to figure out how to do this from the code already, or from our other visualisation tutorials.

As for next steps, you might want to plot this on a pitch (pitch drawing tutorial here):

So now we can see where our team are performing their defensive actions – although remember a few players are missing. In terms of analysis, does this suggest that this team defends better on the left? Or is it more likely that they faced a team that largely attacked on that side? Visualisation is just one small piece of any analysis!

### Summary

In this tutorial, we have practiced filtering a dataframe by player or team, then using SciPy’s convex hull tool to create the data for plotting the smallest area that contains our datapoints.

Some nice extensions to this that you may want to play with include adding some annotations for player names, or changing colours for each player. Of course, these charts aren’t limited to defensive metrics – why not take a look at penalty area entry pass zones, or compare goalkeeper distributions? However you build on this work, show us what you’re achieving on Twitter @FC_Python!

Find further visualisation tutorials here!

## Parsing Opta F24 Files: An Introduction to XML in Python

If you want to move past aggregate data on players and teams, you probably want to start looking at match event data. Obtaining this data can be difficult, and even when you get there, it is often in an XML file, rather than a table that you might be more comfortable with. This article will take you through parsing an Opta F24 XML file into a table containing the passes within a game.

Firstly, it is probably worth understanding what an XML file is. It is simply a text file structured to hold data. It uses things called tags to explain what bits of data it holds and organises them in a structure that makes it easy to understand and use elsewhere. If you have any experience with HTML, it is very similar. Here is a simple example showing how an XML file might look to send data about managerial changes:

```<team name = "Manchester United">
<manager name = "Jose Mourinho" start = "2016/05/27" end = "2018/12/18" />
<manager name = "Ole Gunnar Solskjær" start = "2018/12/19" />
</team>
```

So here we have a hierarchy of team -> manager, with attributes within each data point such as team name or start date. Hopefully it is clear to see how this standardised method of sharing information might make it so much easier to use data in our work or share data with others – especially when data have different amounts of characteristics as we see above.

However, if we want to use lots of Python’s plotting and analysis capabilities, we will likely need this data in a table – and this is what we will work towards. We’ll take the following steps:

1. Import modules
3. Explore the Opta F24 XML structure
4. Iterate through the match events and save pass data into lists
5. Merge these lists together into a table

Once we get through all of these steps, we’ll have a nice table with which we can do loads of plotting and analysis with!

As a quick note, this Opta feed isn’t publicly available and is largely only found in clubs, analytics companies and media organisations. Manchester City and Opta briefly released a season’s worth of data from 2011-12, although FC Python unfortunately do not have this available.

Follow along and import the modules below to get started!

In [1]:
```import csv
import xml.etree.ElementTree as et
import numpy as np
import pandas as pd
from datetime import datetime as dt
```

Now that we’re ready to go, the first thing that we need to do is import our XML file. The XML module that we added above makes this really simple. The two lines below will take an XML file and parse it into something that we could navigate just like we do with an object by using square brackets:

In [2]:
```tree = et.ElementTree(file = "yourf24XMLfile.xml")
gameFile = tree.getroot()
```

Opening up the XML file in a text editor, we can see that the F24 file is structured similar to below:

```<Container>
<Game>
<Event>
<Event Qualifiers>
</Event Qualifiers>
</Event>
...
<Event>
<Event Qualifiers>
</Event Qualifiers>
</Event>
</Game>
</Container>
```

So we have a container for all the data, then a game that holds each event. Within each event, there are attributes telling us about the event as well as qualifiers giving even more information about each events.

We will get around to making sense of the events, but let’s take a look at what information we are given about the game itself. We can do this with the ‘.attrib’ method from the XML module imported earlier. Let’s check out the attributes from the first entry into our XML file (the one containing the match events). We can do this by using square brackets to select aspecific entry:

In [3]:
```gameFile[0].attrib
```
Out[3]:
```{'away_team_id': '43',
'away_team_name': 'Manchester City',
'competition_id': '8',
'competition_name': 'English Barclays Premier League',
'game_date': '2015-09-12T15:00:00',
'home_team_id': '31',
'home_team_name': 'Crystal Palace',
'id': '803206',
'matchday': '5',
'period_1_start': '2015-09-12T15:00:12',
'period_2_start': '2015-09-12T16:05:00',
'season_id': '2015',
'season_name': 'Season 2015/2016'}```

Awesome, we get loads of information about the match in a dictionary, where the data is laid out with a key, then the value. For example, we can see the team names and also their Opta IDs.

These match details could be useful if we were processing lots of events at the same time and needed to differentiate them. Let’s look at a quick example of formatting a string with the details found here:

In [4]:
```#Print a string with the two teams, using %s and the attrib to dynamically fill the string
print ("{} vs {}".format(gameFile[0].attrib["home_team_name"], gameFile[0].attrib["away_team_name"]))
```
Out[4]:
```Crystal Palace vs Manchester City
```

Moving onto match events, we saw in the structure of the file that the match events lie within the game details tags. Let’s use another square bracket to navigate to the first event:

In [5]:
```gameFile[0][0].attrib
```
Out[5]:
```{'event_id': '1',
'id': '1467084299',
'last_modified': '2015-09-16T16:50:12',
'min': '0',
'outcome': '1',
'period_id': '16',
'sec': '0',
'team_id': '43',
'timestamp': '2015-09-12T14:00:09.141',
'type_id': '34',
'version': '1442418612592',
'x': '0.0',
'y': '0.0'}```

Looking through this, there is a lot to try and get our heads around. We can see that there are event keys like min, sec, x and y – these are quite easy to understand. But the values, like outcome: 1 and event_id: 1, don’t really make much sense by themselves. This is particularly important when it comes to teams, as we only have their ID and not their name. We’ll tidy that up soon.

This is because the Opta XML uses lots of IDs rather than names. You’ll need to find documentation from Opta (although versions of this can be Googled) to find out what all of them are. But first one’s free, and our event_id: 1 is pass – as you’d probably expect for the first event from the game.

You might also remember that our events contained qualifiers. Let’s again use square brackets to pull the first one out for the event above:

In [6]:
```gameFile[0][0][0].attrib
```
Out[6]:
```{'id': '1784607081',
'qualifier_id': '44',
'value': '1, 2, 2, 3, 2, 2, 3, 3, 4, 4, 3, 5, 5, 5, 5, 5, 5, 5'}```

We don’t need to know what this means, but it is useful to understand the structure of the file as we will be going on to iterate through each event and qualifier to turn it into a table for further analysis.

At its simplest, all we are going to do is loop through each of the events that we have identified above, identify the passes and take the details that we want from each. These details will go into different lists for different data categories (player, team, success, etc.). We will then put these lists into a table which is then ready for analysis, plotting or exporting.

Firstly though, I’d like to have my team names come through to make this a bit more readble, while than the team ID. The events only carry the team ID, so let’s create a dictionary that will allow us to later swap the ID for the team name:

In [7]:
```team_dict = {gameFile[0].attrib["home_team_id"]: gameFile[0].attrib["home_team_name"],
gameFile[0].attrib["away_team_id"]: gameFile[0].attrib["away_team_name"]}

print(team_dict)
```
```{'31': 'Crystal Palace', '43': 'Manchester City'}
```

For this tutorial, we’re simply going to take the x/y locations of the pass origin and destination, the time of the pass, the team and whether or not it was successful.

There’s so much more that we could take, such as the players, the pass length or any other details that you spot in the XML. If you’d like to pull those out too, doing so will make a great extension to this tutorial!

We’re going to start by creating the empty lists for our data:

In [8]:
```#Create empty lists for the 8 columns we're collecting data for
x_origin = []
y_origin = []
x_destination = []
y_destination = []
outcome = []
minute = []
half = []
team = []
```

The main part of the tutorial sees us going event-by-event and adding our desired details only when the event is a pass. To do this, we will use a for loop on each event, and when the event is a pass (id = 1), we will append the correct attribute to our lists created above. Some of these details are hidden in the qualifiers, so we’ll also iterate over those to get the information needed there.

This tutorial probably isn’t the best place to go through the intricacies of the feed, so take a look at the docs if you’re interested.

Follow the code below with comments on each of the above steps:

In [9]:
```#Iterate through each game in our file - we only have one
for game in gameFile:

#Iterate through each event
for event in game:

#If the event is a pass (ID = 1)
if event.attrib.get("type_id") == '1':

#To the correct list, append the correct attribute using attrib.get()
x_origin.append(event.attrib.get("x"))
y_origin.append(event.attrib.get("y"))
outcome.append(event.attrib.get("outcome"))
minute.append(event.attrib.get("min"))
half.append(event.attrib.get("period_id"))
team.append(team_dict[event.attrib.get("team_id")])

#Iterate through each qualifier
for qualifier in event:

#If the qualifier is relevant, append the information to the x or y destination lists
if qualifier.attrib.get("qualifier_id") == "140":
x_destination.append(qualifier.attrib.get("value"))
if qualifier.attrib.get("qualifier_id") == "141":
y_destination.append(qualifier.attrib.get("value"))
```

If this has worked correctly, we should have 8 lists populated. Let’s check out the minutes list:

In [10]:
```print("The list is " + str(len(minute)) + " long and the 43rd entry is " + minute[42])
```
```The list is 956 long and the 43rd entry is 2
```

You can check out each list in more detail, but they should work just fine.

Our final task is to create a table for our data from our lists. To do this, we just need to create a list of our column headers, then assign the list to each one. We’ll then flip our table to make it long, rather than wide – just like you would want to see in a spreadsheet. Let’s take a look:

In [11]:
```#Create a list of our 8 columns/lists
column_titles = ["team", "half", "min", "x_origin", "y_origin", "x_destination", "y_destination", "outcome"]

#Use pd.DataFrame to create our table, assign the data in the order of our columns and give it the column titles above
final_table = pd.DataFrame(data=[team, half, minute, x_origin, y_origin, x_destination, y_destination, outcome], index=column_titles)

#Transpose, or flip, the table. Otherwise, our table will run from left to right, rather than top to bottom
final_table = final_table.T

#Show us the top 5 rows of the table
```
Out[11]:
team half min x_origin y_origin x_destination y_destination outcome
0 Manchester City 1 0 50.0 50.0 52.2 50.7 1
1 Manchester City 1 0 52.2 50.7 46.7 50.4 1
2 Manchester City 1 0 46.8 51.2 27.1 68.2 1
3 Manchester City 1 0 29.2 71.2 28.3 92.9 1
4 Manchester City 1 0 29.5 94.2 56.9 95.3 1

So this is great for passes, and the same logic would apply for shots, fouls or even all events at the same time – just expand on the above with the relevant IDs from the Opta docs. And analysts, if you’re still struggling to get it done, the emergency loan window is always open!

Now that we’ve taken a complex XML and parsed the passes into a table, there’s a number of things that we can do. We could put the table into a wider dataset, do some analysis of these passes, visualise straight away or just export our new table to a csv:

In [12]:
```final_table.to_csv("pass_data.csv", index=False)

```

Heatmap – code taken from the FC Python tutorial

Passmap – full tutorial here.

### Summary

In this tutorial, we’ve learned a bit about XML structures and the Opta F24 XML specifically. We have seen how to import them into Python and parse them into empty lists. With these now-full lists, we have gone on to pull these into a single table. From here, it is much easier to run our analysis, plot data or do whatever else we like. The further beauty of this comes in automating your analysis for future games and giving yourself hours of time each week.

Huge credit belongs to a number of sources that helped with this piece, including Imran Khan, FC R Stats and plenty of other posts that take a look at the feed.

With your newfound data from the Opta F24 XML, why not practice your visualisation skills with the data? Check out our collection of visualisation tutorials here.