Let’s take a look at what else pandas can do with our datasets with a few examples of old and new operations.
#Import packages, load csv of data and show the top rows with '.head()' import pandas as pd import numpy as np df = pd.read_csv('data/Results.csv') df.head()
|1||13/08/2016||Crystal Palace||West Brom||0||1||C Pawson|
|4||13/08/2016||Man City||Sunderland||2||1||R Madley|
Cool, we have some football results! Let’s look what teams are in the set with .unique() – this gives us the unique values of a column.
print(df['HomeTeam'].unique()) print(str(len(df['HomeTeam'].unique())) + " unique teams!")
['Burnley' 'Crystal Palace' 'Everton' 'Hull' 'Man City' 'Middlesbrough' 'Southampton' 'Arsenal' 'Bournemouth' 'Chelsea' 'Man United' 'Leicester' 'Stoke' 'Swansea' 'Tottenham' 'Watford' 'West Brom' 'Sunderland' 'West Ham' 'Liverpool'] 20 unique teams!
20 teams – each Premier League team from that season. A full dataset will have 380 games (20 teams play 19 home games) – let’s test this:
len(df) == (20*19)
Cool – a whole season of results!
What data do we have in the columns?
Index(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'Referee'], dtype='object')
Date, teams, full time home and away goals and the referee.
Let’s use a new operation – del – to permanantly delete the referees list. This article isn’t interested in trying to find criticism of the refs!
Functions with dataFrames
Pandas also allows us to easily apply functions and sums to dataFrames and the series that they are made of.
Below, we will create two new columns:
1) Result – Home score minus away score.
2) ResultText – Strings saying whether home or away won, or if the match was tied
#Series can do lots of sums for us very quickly df['Result'] = df['FTHG'] - df['FTAG'] #Define a new function that calculates the winner from the above number def findWinner(value): #1 or higher means the home team won if (value>0): return("Home Win") #0 means a tie elif (value==0): return("Draw") #Otherwise, the away team must have won else: return("Away Win") df['ResultText'] = df['Result'].apply(findWinner) df.head(3)
|1||13/08/2016||Crystal Palace||West Brom||0||1||-1||Away Win|
Another application would be to see if more goals are scored by the home or away team. I’m sure you know the answer, but let’s check the averages:
As a broad rule for the season, the home team should expect a 0.4 goal advantage – great analysis!
Let’s do some more. What is the average for home and away goals during home and away wins?
Even if you win away, the home team will still score more often than not. So you’ll need to score at least 2 most of the time. Same for home wins too, although teams winning at home score slightly less than teams winning away.
Another question – I’m a fan who loves to see goals, which team should I check out?
#Create a total goals field by adding up home & away. df['TotalGoals'] = df['FTHG'] + df['FTAG'] #Group dataFrame by home team and look at the mean total goals. #Then sort in descending order df.groupby('HomeTeam').mean()['TotalGoals'].sort_values(ascending=False)
HomeTeam Chelsea 3.789474 Bournemouth 3.368421 Liverpool 3.315789 Hull 3.315789 Swansea 3.210526 Everton 3.052632 Tottenham 2.947368 Leicester 2.947368 Arsenal 2.894737 Man City 2.842105 Watford 2.842105 Sunderland 2.631579 West Ham 2.631579 West Brom 2.578947 Crystal Palace 2.578947 Stoke 2.526316 Burnley 2.421053 Middlesbrough 2.105263 Man United 2.000000 Southampton 2.000000 Name: TotalGoals, dtype: float64
Looks like we should have watched Chelsea at home this season – nearly 4 goals a game. Massive yawns at Old Trafford, however!
Let’s check out the teams when they play away.
AwayTeam Arsenal 3.473684 Man City 3.421053 Crystal Palace 3.368421 West Ham 3.210526 Bournemouth 3.052632 Liverpool 3.000000 Tottenham 2.947368 Leicester 2.894737 Hull 2.842105 Swansea 2.842105 Watford 2.842105 Southampton 2.684211 Stoke 2.578947 Everton 2.526316 Sunderland 2.526316 Burnley 2.526316 Chelsea 2.421053 Man United 2.368421 West Brom 2.368421 Middlesbrough 2.105263 Name: TotalGoals, dtype: float64
Now Chelsea look pretty boring away from home. Arsenal, City or Palace will get our TV time if they’re playing away.
Your analysis is helping our fan to solve a real-life problem and they will hopefully make better decisions on what to watch – impressive stuff!
This section takes you through a few new Pandas operations, but they really are the tip of the iceberg. You’ll learn so many more as you read on here and elsewhere.
This page took you through deleting unneeded columns, applying our own functions and sums to create new columns and then used these to solve a problem for a real life fan.
Continue to learn more dataFrame operations and your analysis toolkit will grow exponentially. Now that we have some comfort with dataFrames, you can dive deeper into more complex mathematical applications or even onto visualising and communicating our insights.