Let’s take a look at what else pandas can do with our datasets with a few examples of old and new operations.
#Import packages, load csv of data and show the top rows with '.head()'
import pandas as pd
import numpy as np
df = pd.read_csv('data/Results.csv')
df.head()
Cool, we have some football results! Let’s look what teams are in the set with .unique() – this gives us the unique values of a column.
print(df['HomeTeam'].unique())
print(str(len(df['HomeTeam'].unique())) + " unique teams!")
20 teams – each Premier League team from that season. A full dataset will have 380 games (20 teams play 19 home games) – let’s test this:
len(df) == (20*19)
Cool – a whole season of results!
What data do we have in the columns?
df.columns
Date, teams, full time home and away goals and the referee.
Let’s use a new operation – del – to permanantly delete the referees list. This article isn’t interested in trying to find criticism of the refs!
del df['Referee']
Functions with dataFrames
Pandas also allows us to easily apply functions and sums to dataFrames and the series that they are made of.
Below, we will create two new columns:
1) Result – Home score minus away score.
2) ResultText – Strings saying whether home or away won, or if the match was tied
#Series can do lots of sums for us very quickly
df['Result'] = df['FTHG'] - df['FTAG']
#Define a new function that calculates the winner from the above number
def findWinner(value):
#1 or higher means the home team won
if (value>0):
return("Home Win")
#0 means a tie
elif (value==0):
return("Draw")
#Otherwise, the away team must have won
else:
return("Away Win")
df['ResultText'] = df['Result'].apply(findWinner)
df.head(3)
Another application would be to see if more goals are scored by the home or away team. I’m sure you know the answer, but let’s check the averages:
print(df['FTHG'].mean())
print(df['FTAG'].mean())
As a broad rule for the season, the home team should expect a 0.4 goal advantage – great analysis!
Let’s do some more. What is the average for home and away goals during home and away wins?
df.groupby('ResultText').mean()
Even if you win away, the home team will still score more often than not. So you’ll need to score at least 2 most of the time. Same for home wins too, although teams winning at home score slightly less than teams winning away.
Another question – I’m a fan who loves to see goals, which team should I check out?
#Create a total goals field by adding up home & away.
df['TotalGoals'] = df['FTHG'] + df['FTAG']
#Group dataFrame by home team and look at the mean total goals.
#Then sort in descending order
df.groupby('HomeTeam').mean()['TotalGoals'].sort_values(ascending=False)
Looks like we should have watched Chelsea at home this season – nearly 4 goals a game. Massive yawns at Old Trafford, however!
Let’s check out the teams when they play away.
df.groupby('AwayTeam').mean()['TotalGoals'].sort_values(ascending=False)
Now Chelsea look pretty boring away from home. Arsenal, City or Palace will get our TV time if they’re playing away.
Your analysis is helping our fan to solve a real-life problem and they will hopefully make better decisions on what to watch – impressive stuff!
Summary
This section takes you through a few new Pandas operations, but they really are the tip of the iceberg. You’ll learn so many more as you read on here and elsewhere.
This page took you through deleting unneeded columns, applying our own functions and sums to create new columns and then used these to solve a problem for a real life fan.
Continue to learn more dataFrame operations and your analysis toolkit will grow exponentially. Now that we have some comfort with dataFrames, you can dive deeper into more complex mathematical applications or even onto visualising and communicating our insights.