Pandas Operations

Let’s take a look at what else pandas can do with our datasets with a few examples of old and new operations.

In [1]:
#Import packages, load csv of data and show the top rows with '.head()'
import pandas as pd
import numpy as np

df = pd.read_csv('data/Results.csv')
df.head()
Out[1]:
Date HomeTeam AwayTeam FTHG FTAG Referee
0 13/08/2016 Burnley Swansea 0 1 J Moss
1 13/08/2016 Crystal Palace West Brom 0 1 C Pawson
2 13/08/2016 Everton Tottenham 1 1 M Atkinson
3 13/08/2016 Hull Leicester 2 1 M Dean
4 13/08/2016 Man City Sunderland 2 1 R Madley

Cool, we have some football results! Let’s look what teams are in the set with .unique() – this gives us the unique values of a column.

In [2]:
print(df['HomeTeam'].unique())

print(str(len(df['HomeTeam'].unique())) + " unique teams!")
['Burnley' 'Crystal Palace' 'Everton' 'Hull' 'Man City' 'Middlesbrough'
 'Southampton' 'Arsenal' 'Bournemouth' 'Chelsea' 'Man United' 'Leicester'
 'Stoke' 'Swansea' 'Tottenham' 'Watford' 'West Brom' 'Sunderland'
 'West Ham' 'Liverpool']
20 unique teams!

20 teams – each Premier League team from that season. A full dataset will have 380 games (20 teams play 19 home games) – let’s test this:

In [3]:
len(df) == (20*19)
Out[3]:
True

Cool – a whole season of results!

What data do we have in the columns?

In [4]:
df.columns
Out[4]:
Index(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'Referee'], dtype='object')

Date, teams, full time home and away goals and the referee.

Let’s use a new operation – del – to permanantly delete the referees list. This article isn’t interested in trying to find criticism of the refs!

In [5]:
del df['Referee']

Functions with dataFrames

Pandas also allows us to easily apply functions and sums to dataFrames and the series that they are made of.

Below, we will create two new columns:

1) Result – Home score minus away score.

2) ResultText – Strings saying whether home or away won, or if the match was tied

In [6]:
#Series can do lots of sums for us very quickly

df['Result'] = df['FTHG'] - df['FTAG']

#Define a new function that calculates the winner from the above number
def findWinner(value):
    #1 or higher means the home team won
    if (value>0):
        return("Home Win")
    #0 means a tie
    elif (value==0):
        return("Draw")
    #Otherwise, the away team must have won
    else:
        return("Away Win")
    
df['ResultText'] = df['Result'].apply(findWinner)
df.head(3)
Out[6]:
Date HomeTeam AwayTeam FTHG FTAG Result ResultText
0 13/08/2016 Burnley Swansea 0 1 -1 Away Win
1 13/08/2016 Crystal Palace West Brom 0 1 -1 Away Win
2 13/08/2016 Everton Tottenham 1 1 0 Draw

Another application would be to see if more goals are scored by the home or away team. I’m sure you know the answer, but let’s check the averages:

In [7]:
print(df['FTHG'].mean())
print(df['FTAG'].mean())
1.59736842105
1.20263157895

As a broad rule for the season, the home team should expect a 0.4 goal advantage – great analysis!

Let’s do some more. What is the average for home and away goals during home and away wins?

In [8]:
df.groupby('ResultText').mean()
Out[8]:
FTHG FTAG Result
ResultText
Away Win 0.568807 2.504587 -1.935780
Draw 0.928571 0.928571 0.000000
Home Win 2.497326 0.566845 1.930481

Even if you win away, the home team will still score more often than not. So you’ll need to score at least 2 most of the time. Same for home wins too, although teams winning at home score slightly less than teams winning away.

Another question – I’m a fan who loves to see goals, which team should I check out?

In [9]:
#Create a total goals field by adding up home & away.
df['TotalGoals'] = df['FTHG'] + df['FTAG']

#Group dataFrame by home team and look at the mean total goals.
#Then sort in descending order
df.groupby('HomeTeam').mean()['TotalGoals'].sort_values(ascending=False)
Out[9]:
HomeTeam
Chelsea           3.789474
Bournemouth       3.368421
Liverpool         3.315789
Hull              3.315789
Swansea           3.210526
Everton           3.052632
Tottenham         2.947368
Leicester         2.947368
Arsenal           2.894737
Man City          2.842105
Watford           2.842105
Sunderland        2.631579
West Ham          2.631579
West Brom         2.578947
Crystal Palace    2.578947
Stoke             2.526316
Burnley           2.421053
Middlesbrough     2.105263
Man United        2.000000
Southampton       2.000000
Name: TotalGoals, dtype: float64

Looks like we should have watched Chelsea at home this season – nearly 4 goals a game. Massive yawns at Old Trafford, however!

Let’s check out the teams when they play away.

In [10]:
df.groupby('AwayTeam').mean()['TotalGoals'].sort_values(ascending=False)
Out[10]:
AwayTeam
Arsenal           3.473684
Man City          3.421053
Crystal Palace    3.368421
West Ham          3.210526
Bournemouth       3.052632
Liverpool         3.000000
Tottenham         2.947368
Leicester         2.894737
Hull              2.842105
Swansea           2.842105
Watford           2.842105
Southampton       2.684211
Stoke             2.578947
Everton           2.526316
Sunderland        2.526316
Burnley           2.526316
Chelsea           2.421053
Man United        2.368421
West Brom         2.368421
Middlesbrough     2.105263
Name: TotalGoals, dtype: float64

Now Chelsea look pretty boring away from home. Arsenal, City or Palace will get our TV time if they’re playing away.

Your analysis is helping our fan to solve a real-life problem and they will hopefully make better decisions on what to watch – impressive stuff!

Summary

This section takes you through a few new Pandas operations, but they really are the tip of the iceberg. You’ll learn so many more as you read on here and elsewhere.

This page took you through deleting unneeded columns, applying our own functions and sums to create new columns and then used these to solve a problem for a real life fan.

Continue to learn more dataFrame operations and your analysis toolkit will grow exponentially. Now that we have some comfort with dataFrames, you can dive deeper into more complex mathematical applications or even onto visualising and communicating our insights.