Pandas

Connecting to SQLite Databases and Creating Dataframes in Python

At some point, you will come across some really useful data in a format other than a csv. One format might be an SQLite database, containing one or more related data tables. This 5-minute Extra Time article will run through extracting data from SQLite tables into a dataframe for further analysis. Firstly, we will connect to the databse, then look at queries to pull the information out. Finally, we will look at joining information from two tables into one dataframe.

This article will look at connecting to the European Football Database on Kaggle, taking inspiration from many of the kernels on there.

Let’s import the sqlite3 and pandas modules and get started.

In [1]:
import sqlite3
import pandas as pd

Connecting to the database

Firstly, we need to establish a connection to the sqlite file. We do this with the ‘connect()’ function from the sqlite3 module, passing it the location of the file on your machine.

Our database variable holds the location of the file, and the conn variable will be assigned to holding the connection to the database. Using variables like this will make our code easier to refer to later in the article and easier to change in the future.

In [2]:
database = "soccer.sqlite"
conn = sqlite3.connect(database)

Well done, you’ve connected to an sqlite database! We are now set to call for data from it and store it into a dataframe.

Saving data from sqlite3 to a dataframe

This time, we will use pandas to read from our connection with the ‘read_sql’ function – passing the connected database variable and some SQL code that tells us what data to pull. The end result will be a dataframe containing the data that the SQL code calls for.

SQL is a very accessible language that you should definitely spend some time getting a basic handle on. You can find a glossary of the language here.

For readability, let’s assign our SQL code to a variable called query. We will also use a very simple command to call for all of the data from a table called ‘Player’ in the database:

In [3]:
#The * refers to all available data
query = "SELECT * FROM Player"

players = pd.read_sql(query, conn)

players.head()
Out[3]:
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 2 155782 Aaron Cresswell 189615 1989-12-15 00:00:00 170.18 146
2 3 162549 Aaron Doran 186170 1991-05-13 00:00:00 170.18 163
3 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
4 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154

There you have it, a dataframe ready for your analysis from an SQLite database – great work!

As a next step, let’s use another example – this time developing our query a bit. We now, for whatever reason, want to only find players taller than Aaron Cresswell (170.18 cm). Use the ‘WHERE’ keyword to add a clause:

In [4]:
query2 = "SELECT * FROM Player WHERE height > 170.18"

players2 = pd.read_sql(query2, conn)

players2.head()
Out[4]:
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
2 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154
3 6 27316 Aaron Hunt 158138 1986-09-04 00:00:00 182.88 161
4 7 564793 Aaron Kuhl 221280 1996-01-30 00:00:00 172.72 146

Joining multiple tables in sqlite3 to one dataframe

Ramping up the complexity, we can also make calls to different tables in the dataframe and join them together into one dataframe.

In short, our logic will be to match the player_api_id seen above, with the same number in a different table. Where these IDs match, we then join the columns from each table together into one dataframe.

Therefore, our query will selectfrom one table – which we note as ‘a’. We will then join this with an inner join (check out other different join types here), and state which columns we are joining on.

Take your time to see how we express this in the query:

In [6]:
query3 = """SELECT * FROM Player_Attributes a
 INNER JOIN (SELECT player_name, birthday, player_api_id AS p_id, height, weight FROM Player)
 b ON a.player_api_id = b.p_id;"""

players3 = pd.read_sql(query3, conn)

players3.head()
Out[6]:
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday p_id height weight
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187

5 rows × 47 columns

Summary

As with lots of our other tutorial topics, Python takes a complex task and makes it achievable with both brief and accessible language. We have looked at how we initially connect to a sqlite database, how we define SQL queries to pull data from it and even how we can join data from multiple tables into one dataframe.

For next steps, look at how you can dive into the data with our analysis crash course, or learn how to pull Kaggle datasets like the one used above through the Kaggle API.

Posted by FCPythonADMIN in Extra Time

How much does it cost to fill the Panini World Cup album? Simulations in Python

With the World Cup just 3 months away, the best bit of the tournament build up is upon us – the Panini sticker album.

For those looking to invest in a completed album to pass onto grandchildren, just how much will you have to spend to complete it on your own? Assuming that each sticker has an equal chance of being found, this is a simple random number problem that we can recreate in Python.

This article will show you how to create a function that allows you to estimate how much you will need to spend, before you throw wads of cash at sticker boxes to end with a half-finished album. Load up pandas and numpy and let’s kick on.

In [1]:
import pandas as pd
import numpy as np

To solve this, we are going to recreate our sticker album. It will be an empty list that will take on the new stickers that we find in each pack.

We will also need a few variables to act as counters alongside this list:

  • Stickers needed
  • How many packets have we bought?
  • How many swaps do we have?

Let’s define these:

In [1]:
stickersNeeded = 682
packetsBought = 0
stickersGot = []
swapStickers = 0

Now, we need to run a simulation that will open packs, check each sticker and either add it to our album or to our swaps pile.

We will do this by running a while loop that completes once the album is full.

This loop will open a pack of 5 stickers and check whether or not it is featured in the album already. To simulate the sticker, we will simply assign it a random number within the album. If this number is already present, we add it to the swap pile. If it is a new sticker, we append it to our album list.

We will also need to update our counters for packets bought, stickers needed and swaps throughout.

Pretty simple process overall! Let’s take a look at how we implement this loop:

In [2]:
while stickersNeeded > 0:
    
        #Buy a new packet
        packetsBought += 1

        #For each sticker, do some things 
        for i in range(0,5):
            
            #Assign the sticker a random number
            stickerNumber = np.random.randint(0,681)
    
            #Check if we have the sticker
            if stickerNumber not in stickersGot:
                
                #Add it to the album, then reduce our stickers needed count
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            #Throw it into the swaps pile
            else:
                swapStickers += 1

Each time you run that, you are simulating the entire album completion process! Let’s check out the results:

In [3]:
{"Packets":packetsBought,"Swaps":swapStickers}
Out[3]:
{'Packets': 939, 'Swaps': 4013}

939 packets?! 4013 swaps?! Surely these must be outliers… let’s add all of this into one function and run it loads of times over.

As the number of stickers in a pack and the sticker total may change, let’s define these as arguments that we can change with future uses of the function:

In [4]:
def calculateAlbum(stickersInPack = 5, costOfPackp = 80, stickerTotal=682):
    stickersNeeded = stickerTotal
    packetsBought = 0
    stickersGot = []
    swapStickers = 0


    while stickersNeeded > 0:
        packetsBought += 1

        for i in range(0,stickersInPack):
            stickerNumber = np.random.randint(0,stickerTotal)

            if stickerNumber not in stickersGot:
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            else:
                swapStickers += 1

    return{"Packets":packetsBought,"Swaps":swapStickers,
           "Total Cost":(packetsBought*costOfPackp)/100}
In [5]:
calculateAlbum()
Out[5]:
{'Packets': 1017, 'Swaps': 4403, 'Total Cost': 813.6}

So our calculateAlbum function does exactly the same as our instructions before, we have just added a total cost.

Let’s run this 1000 times over and see what we can truly expect if we want to complete the album:

In [6]:
a=0
b=0
c=0

for i in range(0, 1000):
    a += calculateAlbum()["Packets"]
    b += calculateAlbum()["Swaps"]
    c += calculateAlbum()["Total Cost"]

{"Packets":a/1000,"Swaps":b/1000,"Total Cost":c/1000}
Out[6]:
{'Packets': 969.582, 'Swaps': 4197.515, 'Total Cost': 773.4824}

970 packets, over 4000 swaps and the best part of £800 on the album. I think we’re going to need some people to swap with!

Of course, as you run these arguments, you will have different answers throughout. Hopefully here, however, our numbers are quite close together.

Summary

In this article, we have seen a basic example of running simulations with random numbers to answer a question.

We followed the process of replicating the album experience and running it once, then 1000 times to get an average expectation. As with any process involving random numbers, you will get different answers each time, so through running it loads of times over, we get an average that should remove the effect of any outliers.

We also designed our simulations to take on different parameters such as number of stickers needed, stickers in a pack, etc. This allows us to use the same functions when World Cup 2022 has twice the number of stickers!

For more examples of random numbers and simulations, check out our expected goals tutorial.

Posted by FCPythonADMIN in Blog

Calculating ‘per 90’ with Python and Fantasy Football

When we are comparing data between players, it is very important that we standardise their data to ensure that each player has the same ‘opportunity’ to show their worth. The simplest way for us to do this, is to ensure that all players have the same amount of time within which to play. One popular way of doing this in football is to create ‘per 90’ values. This means that we will change our total amounts of goals, shots, etc. to show how many a player will do every 90 minutes of football that they play. This article will run through creating per 90 figures in Python by applying them to fantasy football points and data.

Follow the examples along below and feel free to use them where you are. Let’s get started by importing our modules and taking a look at our data set.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("../Data/Fantasy_Football.csv")
data.head()
Out[1]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_saved penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0 0.0 1 1
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 0 1 0 53 4 419 42.7 1 1
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0 0.0 1 1
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 0 3 0 0 14 421 62.5 2 1
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 0 2 77 15.7 2 1

5 rows × 26 columns

Our data has a host of data on our players’ fantasy football performance. We have their names, of course, and also their points and contributing factors (goals, clean sheets, etc.). Crucially, we have the players’ minutes played – allowing us to calculate their per 90 figures for the other variables.

Calculating our per 90 numbers is reasonably simple, we just need to find out how many 90 minute periods our player has played, then divide the variable by this value. The function below will show this step-by-step and show Kane’s goals p90 in the Premier League at the time of writing (goals = 20, minutes = 1868):

In [2]:
def p90_Calculator(variable_value, minutes_played):
    
    ninety_minute_periods = minutes_played/90
    
    p90_value = variable_value/ninety_minute_periods
    
    return p90_value

p90_Calculator(20, 1868)
Out[2]:
0.9635974304068522

There we go, Kane scores 0.96 goals per 90 in the Premier League! Our code, while explanatory is three lines long, when it can all be in one line. Let’s try again, and check that we get the same value:

In [3]:
def p90_Calculator(value, minutes):
    return value/(minutes/90)

p90_Calculator(20, 1868)
Out[3]:
0.9635974304068522

Great job! The code has the same result, in a third of the lines, and I still think it is fairly easy to understand.

Next up, we need to apply this to our dataset. Pandas makes this easy, as we can simply call a new column, and run our command with existing columns as arguments:

In [4]:
data["total_points_p90"] = p90_Calculator(data.total_points,data.minutes)
data.total_points_p90.fillna(0, inplace=True)
data.head()
Out[4]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 1 0 53 4 419 42.7 1 1 3.652174
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 3 0 0 14 421 62.5 2 1 4.288401
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 2 77 15.7 2 1 3.846154

5 rows × 27 columns

And there we have a total points per 90 column, which will hopefully offer some more insight than a simple points total. Let’s sort our values and view the top 5 players:

In [5]:
data.sort_values(by='total_points_p90', ascending =False).head()
Out[5]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
271 Tuanzebe 1 Axel Tuanzebe 38 39 0 1.7 1 1.0 0 0 0 0 0 3 0.0 2 12 90.0
322 Sims 20 Joshua Sims 39 43 0 0.1 1 1.0 0 0 0 0 0 3 0.0 3 14 90.0
394 Janssen 6 Vincent Janssen 9 74 0 0.1 1 1.0 0 0 0 0 0 2 0.0 4 17 90.0
166 Hefele 38 Michael Hefele 44 42 0 0.1 1 1.0 0 0 0 0 0 4 0.4 2 8 90.0
585 Silva 13 Adrien Sebastian Perruchet Silva 14 60 0 0.0 1 1.0 0 0 0 0 0 5 0.3 3 9 22.5

5 rows × 27 columns

Huh, probably not what we expected here… players with 1 point, and some surprisng names too. Upon further examination, these players suffer from their sample size. They’ve played very few minutes, so their numbers get overly inflated… there’s obviously no way a player gets that many points per 90!

Let’s set a minimum time played to our data to eliminate players without a big enough sample:

In [6]:
data.sort_values(by='total_points_p90', ascending =False)[data.minutes>400].head(10)[["web_name","total_points_p90"]]
Out[6]:
web_name total_points_p90
233 Salah 9.629408
279 Martial 8.927126
246 Sterling 8.378721
225 Coutinho 8.358882
325 Austin 8.003356
278 Lingard 7.951807
544 Niasse 7.460317
256 Agüero 7.346939
389 Son 7.288503
255 Bernardo Silva 7.119403

That seems a bit more like it! We’ve got some of the highest scoring players here, like Salah and Sterling, but if Austin, Lingard and Bernardo Silva can nail down long-term starting spots, we should certainly keep an eye on adding them in!

Let’s go back over this by creating a new column for goals per 90 and finding the top 10:

In [7]:
data["goals_p90"] = p90_Calculator(data.goals_scored,data.minutes)
data.goals_p90.fillna(0, inplace=True)
data.sort_values(by='goals_p90', ascending =False)[data.minutes>400].head(10)[["web_name","goals_p90"]]
Out[7]:
web_name goals_p90
233 Salah 0.968320
393 Kane 0.967222
325 Austin 0.906040
256 Agüero 0.823364
246 Sterling 0.797973
544 Niasse 0.793651
279 Martial 0.728745
258 Jesus 0.714995
278 Lingard 0.632530
160 Rooney 0.630252

Great job! Hopefully you can see that this is a much fairer way to rate our player data – whether for performance, fantasy football or media reporting purposes.

Summary

p90 data is a fundamental concept of football analytics. It is one of the first steps of cleaning our data and making it fit for comparisons. This article has shown how we can apply the concept quickly and easily to our data. For next steps, you might want to take a look at visualising this data, or looking at further analysis techniques.

Posted by FCPythonADMIN in Blog

Dataframes with Pandas

DataFrames power data analysis in Python – they allow us to use grids just like we would conventional spreadsheets. They give us labelled columns and rows, functions, filtering and many more tools to get the most insight and ease of use from our data.

This page will introduce the creation of data frames, a few functions and tools to make selections within them. Let’s import our packages to get started (I’m sure you’ve installed them by now!).

In [1]:
import pandas as pd
import numpy as np

Our table below has a scout report on 4 different players’ shooting, passing and defending skills.

In [2]:
PlayerList = ["Pagbo","Grazemen","Cantay","Ravane"]
SkillList=["Shooting","Passing","Defending"]

#For this example, we have a random number generator for our scout
#I wouldn't recommend this for an actual team
ScoresArray = np.random.randint(1,10,(4,3))

df = pd.DataFrame(data=ScoresArray, index=PlayerList, columns=SkillList)
df
Out[2]:
Shooting Passing Defending
Pagbo 6 7 3
Grazemen 7 3 3
Cantay 2 9 2
Ravane 2 8 5

In this example, dataFrame needs 3 arguments for a fully labelled dataframe – data is the values that make up the body, index goes along the y axis and is the name of each row. Finally, columns runs along the x axis to name the columns.

There are other ways to create dataFrames, but this will serve us perfectly for now.

You’ve come a long way from lists and individual values!

Selecting and indexing

You can probably guess how we select individual values and groups

In [3]:
#Square brackets for columns
#If the result looks familiar, that's because DataFrame columns are series!
df['Shooting']
Out[3]:
Pagbo       6
Grazemen    7
Cantay      2
Ravane      2
Name: Shooting, dtype: int32
In [4]:
#For rows, we use .loc if we use a name
#Turns out that DataFrame rows are also series!
df.loc['Pagbo']
Out[4]:
Shooting     6
Passing      7
Defending    3
Name: Pagbo, dtype: int32
In [5]:
#Or if we use a index number, .iloc
df.iloc[1:3]
Out[5]:
Shooting Passing Defending
Grazemen 7 3 3
Cantay 2 9 2

Creating and removing columns/rows

DataFrames make it really easy for us to be flexible with our datasets. Let’s ask our scout for their thoughts on more players and skills.

In [6]:
#Scout, what about their communication?

df['Communication'] = np.random.randint(1,10,4)
df
Out[6]:
Shooting Passing Defending Communication
Pagbo 6 7 3 5
Grazemen 7 3 3 2
Cantay 2 9 2 3
Ravane 2 8 5 5

To add a new column, we refer to a new column with square brackets, give it a new name then fill it with a series. Remember, our scout uses random numbers as scouting scores.

Our new manager doesn’t care about defending – they want these scores removed from report. The ‘.drop’ method makes this easy:

In [7]:
#axis=1 refers to columns
df = df.drop('Defending',axis=1)
df
Out[7]:
Shooting Passing Communication
Pagbo 6 7 5
Grazemen 7 3 2
Cantay 2 9 3
Ravane 2 8 5

Great job adding and removing columns!

Turns out, though, that our scout didn’t do their homework on the players’ transfer fees. Grazemen is far too expensive and we need to swap him for another player – Mogez.

For rows, we use ‘.drop’ with the axis argument set to 0:

In [8]:
df = df.drop('Grazemen',axis=0)
df
Out[8]:
Shooting Passing Communication
Pagbo 6 7 5
Cantay 2 9 3
Ravane 2 8 5

And to add a new one, we refer to our new row with ‘.loc’ (just like we did to refer to an existing one earlier). We then give this new row the list or series of values. Once again, we just use random numbers to fill the set here.

In [9]:
df.loc['Gomez'] = np.random.randint(1,10,3)
df
Out[9]:
Shooting Passing Communication
Pagbo 6 7 5
Cantay 2 9 3
Ravane 2 8 5
Gomez 6 2 3

Conditonal Selection

In our series, we used a true or false condition to select the data that we wanted to see. We use the exact same logic here. Let’s see where players’ attributes are above 5:

In [10]:
df>5
Out[10]:
Shooting Passing Communication
Pagbo True True False
Cantay False True False
Ravane False True False
Gomez True False False

Cool, we can see which players meet our criteria. You’ll notice that this returns a DataFrame of true or false values. Just like with series, we can use these booleans to return a DataFrame according to our criteria.

Let’s apply this to a column (which we already know is just a series):

In [11]:
df['Shooting']>5
Out[11]:
Pagbo      True
Cantay    False
Ravane    False
Gomez      True
Name: Shooting, dtype: bool

As expected, we have a series of boolean values. If we use square brackets to select our dataframe using these, we will just get the filtered DataFrame.

Therefore, if the coach is asking for players with great shooting skills, we can easily filter our DataFrame.

In [12]:
df[df['Shooting']>5]
Out[12]:
Shooting Passing Communication
Pagbo 6 7 5
Gomez 6 2 3

Summary

Great job getting to understand DataFrames – the tool that will underpin our analysis moving forward.

You have created them, added new rows and columns, then filtered them according to your criteria.

There is lots more to learn about DataFrames, so read around more articles on the topic and learn what you can from other articles that use them.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Dealing with Missing Data

At some point in your dealings with data, you will have to deal with missing values.

Depending on the situation, you might need to fill these gaps, or delete rows/columns entirely. Luckily Pandas makes this easy for us.

Let’s get our modules and dataset prepared, before we look to delete series or fill gaps.

In [1]:
import numpy as np
import pandas as pd

#DataFrame is the contract details for our transfer targets, where known.
#'np.nan' is a numpy value that shows that there is not a number.
# In this case, it demonstrates missing data.

df = pd.DataFrame({'Wage':[150000,123000,np.nan],
'GoalBonus':[4000,np.nan,np.nan],
'ImageRights':[50000,70000,100000]}, 
index=['Konda','Makho','Grey'],
columns=['Wage','GoalBonus','ImageRights'])

df
Out[1]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 NaN 70000
Grey NaN NaN 100000

So we know everything about Konda, 2/3 values for Makho and just Grey’s image rights. Let’s look at how we can tidy this up.

Removing rows & columns with missing data

If you decide to bin the players with missing data, it is simple with the ‘.dropna()’ method:

In [2]:
df.dropna()
Out[2]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000

So we are only left with the player with all values. What if you want to do the same for columns?

In [3]:
df.dropna(axis=1)
Out[3]:
ImageRights
Konda 50000
Makho 70000
Grey 100000

And now we have only the column with complete data – ImageRights. You’ll see that we pass the argument ‘axis=1’ to do look at columns, not rows.

‘.dropna()’ can also take the argument ‘thresh’ to change the amount of missing values you’re happy to deal with. Makho has only 1 missing value, whereas Grey has 2. Below, we’ll allow Makho into our dataset, but continue to exclude Grey:

In [4]:
df.dropna(thresh=2)
Out[4]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 NaN 70000

Fill data

Sometimes, deleting rows and columns is a bit drastic. You may instead want to simply fill in the gaps instead. Rather than ‘.dropna()’, we can instead ‘.fillna()’, passing the desired value as the argument.

In [5]:
df.fillna(value=0)
Out[5]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 0.0 70000
Grey 0.0 0.0 100000

That’s a lot cleaner if we want to perform other opearations, great work!

You might want to be a bit smarter than filling with 0s. As an example, you might want to take a column and use the average to fill the rest of the gaps:

In [6]:
df['Wage'].fillna(value=df['Wage'].mean())
Out[6]:
Konda 150000.0
Makho 123000.0
Grey 136500.0
Name: Wage, dtype: float64

Summary

Dealing with datasets can often be unwieldy and messy. Quite commonly, this is caused by missing values. Pandas helps us greatly with ‘.fillna()’ and ‘.dropna()’ and their ability to remove or change missing values with ease.

If you are having issues with mismatching or datasets split across different files, you might want to check out the joining & merging data article.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Grouping Data

With DataFrames giving us the opportunity to store huge grids of data, we will sometimes want to group particular parts of our data set for analysis. Pandas’ ‘groupby’ method gives us an easy way to do so.

Let’s put together a DataFrame of a team’s recent results with a dictionary:

In [1]:
import pandas as pd

#data is in a dictionary, each entry will be a column
#The first part of the entry is the column name, the second the values
data = {'Opponent':
        ["Atletico Jave","Newtown FC", 
         "Bunton Town", "Fentborough Dynamo"],
       'Location':
        ["Home","Away","Away","Home"],
        'GoalsFor':
        [2,4,3,0],
        'GoalsAgainst':
        [4,0,2,2]}

Matches = pd.DataFrame(data)
Matches
Out[1]:
GoalsAgainst GoalsFor Location Opponent
0 4 2 Home Atletico Jave
1 0 4 Away Newtown FC
2 2 3 Away Bunton Town
3 2 0 Home Fentborough Dynamo

An obvious way to group this data is by home and away matches.

Let’s use the ‘.groupby()’ method to do so. We just have to provide the column that we want to group by. In this case, location.

We’ll assign that to a variable, then call ‘.mean()’ to find the average.

In [2]:
HAMatches = Matches.groupby('Location')
HAMatches.mean()
Out[2]:
GoalsAgainst GoalsFor
Location
Away 1.0 3.5
Home 3.0 1.0

Or cut out the variable and chain the ‘.mean()’ onto the end. Or chain another method:

In [3]:
Matches.groupby('Location').mean()
Out[3]:
GoalsAgainst GoalsFor
Location
Away 1.0 3.5
Home 3.0 1.0
In [4]:
#Describes the dataset for each variable within - this is awesome!
Matches.groupby('Location').describe()
Out[4]:
GoalsAgainst GoalsFor
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Location
Away 2.0 1.0 1.414214 0.0 0.5 1.0 1.5 2.0 2.0 3.5 0.707107 3.0 3.25 3.5 3.75 4.0
Home 2.0 3.0 1.414214 2.0 2.5 3.0 3.5 4.0 2.0 1.0 1.414214 0.0 0.50 1.0 1.50 2.0
In [5]:
#Let's step up the chaining...
#'Groupby' location, then describe it to me...
#Then 'transpose' it (flip it onto its side)...
#Finally, just give me 'Away' data
Matches.groupby('Location').describe().transpose()['Away']
Out[5]:
GoalsAgainst  count    2.000000
              mean     1.000000
              std      1.414214
              min      0.000000
              25%      0.500000
              50%      1.000000
              75%      1.500000
              max      2.000000
GoalsFor      count    2.000000
              mean     3.500000
              std      0.707107
              min      3.000000
              25%      3.250000
              50%      3.500000
              75%      3.750000
              max      4.000000
Name: Away, dtype: float64
In [6]:
print("All that work done in just " + 
      str(len("Matches.groupby('Location').describe().transpose()['Away']"))
     + " characters!")
All that work done in just 58 characters!

Summary

It is staggering how easily we can not only group data, but to also use Pandas to get some insight into our data. Really good job following this far.

We learned how to use ‘groupby()’ to group by location – home or away. We then used methods to describe our data, find averages and even to change the shape of our data frames. Really impressive stuff!

Next up, you might want to take a look at how we can join dataFrames together, how to deal with missing values or how to use even more operations.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Joining Data

Incomplete data sets, whether they are missing individual values or full rows and columns, are a common problem in data analysis. Luckily for us, Pandas has lots of tools to help us make these data sets easier to handle.

In this article, we will run through 2 of Pandas’ main methods for bundling dataFrames together – concatenating and merging.

Let’s set ourselves up with three one-row dataFrames, with stats from our recent matches.

In [1]:
import numpy as np
import pandas as pd

match1 = pd.DataFrame({'Opponent':['Selche FC'],
                      'GoalsFor':[1],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match2 = pd.DataFrame({'Opponent':['Sudaton FC'],
                      'GoalsFor':[3],
                       'GoalsAgainst':[0],
                       'Attendance':[53256]})

match3 = pd.DataFrame({'Opponent':['Ouestjambon United'],
                      'GoalsFor':[4],
                       'GoalsAgainst':[1],
                       'Attendance':[53225]})

match3
Out[1]:
Attendance GoalsAgainst GoalsFor Opponent
0 53225 1 4 Ouestjambon United

Concatenation

Our simplest method of joining data is to simply stick one on the ned of the other – as the concatenate method allows us to do. ‘pd.concat()’ with a list of our data frames will do the trick:

In [2]:
AllMatches = pd.concat([match1,match2,match3])
AllMatches
Out[2]:
Attendance GoalsAgainst GoalsFor Opponent
0 53225 1 1 Selche FC
0 53256 0 3 Sudaton FC
0 53225 1 4 Ouestjambon United

Merging

‘pd.merge()’ will allow us to stick data together left-to-right. Let’s first create more details for our matches above that we can then merge then:

In [3]:
match1scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Billy'],
                            'Opponent':['Selche FC']})
                      
match2scorers = pd.DataFrame({'First':['Sally'],
                              'Last':['Pip'],
                             'Opponent':['Sudaton FC']})

match3scorers = pd.DataFrame({'First':['Sally'],
                             'Last':['Sally'],
                             'Opponent':['Ouestjambon United']})

AllScorers = pd.concat([match1scorers,match2scorers,match3scorers]) 
AllScorers
Out[3]:
First Last Opponent
0 Sally Billy Selche FC
0 Sally Pip Sudaton FC
0 Sally Sally Ouestjambon United
In [4]:
pd.merge(AllMatches, AllScorers, how='inner',on='Opponent')
Out[4]:
Attendance GoalsAgainst GoalsFor Opponent First Last
0 53225 1 1 Selche FC Sally Billy
1 53256 0 3 Sudaton FC Sally Pip
2 53225 1 4 Ouestjambon United Sally Sally

Let’s break down ‘pd.merge()’. Anyone with any SQL experience will have a head start here, as this essentially mimicks merging in SQL.

‘pd.merge()’ takes four arguments. The first two are the dataFrames that we want to merge.

Next, we have the ‘how’ argument, which dictates the type of join that we need. ‘Inner’ is the simplest, and we use that in this example but you should read up on the other types.

Finally is ‘on’, which is the column that we build the dataFrame around. Pandas looks for this value in all merged dataFrames, then adds the other values from the matching rows.

Summary

With merge and concatenate, we have two examples of Pandas tools used to join datasets.

Concatenation allows us to stitch together two datasets with the same columns. Useful if we are adding to an already-existing dataset or glueing together two iterations of data collection. It is much quicker than opening Excel and pasting the data manually!

We have also seen an example of merge, which connects two datasets across a column that is common between the two dataFrames.

These examples are relatively simple and only serve as an introduction to joining data. For more complex cases, check out resources on pandas.join() and more in-depth examples of pandas.merge()

Posted by FCPythonADMIN in Data Analysis, 0 comments

Describing Datasets

Pandas is not only a fantastic module and community around manipulating our datasets, it also gives tools for analysing and describing our data.

This article will take you through just a few of the methods that we have to describe our dataset. Let’s get started by firing up a season-long dataset of referees and their cards given in each game last season.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("../data/Refs.csv")
In [2]:
#Use '.head()' to see the top rows and check out the structure

df.head()
Out[2]:
Date HomeTeam AwayTeam Referee HF AF TF HY AY TY HR AR TR
0 13/08/2016 Burnley Swansea J Moss 10 14 24 3 2 5 0 0 0
1 13/08/2016 Crystal Palace West Brom C Pawson 12 15 27 2 2 4 0 0 0
2 13/08/2016 Everton Tottenham M Atkinson 10 14 24 0 0 0 0 0 0
3 13/08/2016 Hull Leicester M Dean 8 17 25 2 2 4 0 0 0
4 13/08/2016 Man City Sunderland R Madley 11 14 25 1 2 3 0 0 0
In [3]:
#Let's change those shorthand column titles to something more intuitive

df.columns = ['Date','HomeTeam','AwayTeam',
                 'Referee','HomeFouls','AwayFouls',
                 'TotalFouls','HomeYellows','AwayYellows',
                'TotalYellows', 'HomeReds','AwayReds','TotalReds']

df.head(2)
Out[3]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
0 13/08/2016 Burnley Swansea J Moss 10 14 24 3 2 5 0 0 0
1 13/08/2016 Crystal Palace West Brom C Pawson 12 15 27 2 2 4 0 0 0
In [4]:
#And do we have a complete set of 380 matches? len() will tell us.

len(df)
Out[4]:
380

Descriptive statistics

The easiest way to produce an en-masse summary of our dataset is with the ‘.describe()’ method.

This will give us a whole new table of statistics for each numerical column:

  • Count – how many values are there?
  • Mean – what is the mean average? (Sum of values/count of values)
  • STD – what is the standard deviation? This number describes how widely the group differs around the average. If we have a normal distribution, 68% of our values will be within one STD either side of the average.
  • Min – the smallest value in our array
  • 25%/50%/75% – what value accounts for 25%/50%/75% of the data?
  • Max – the highest value in our array
In [5]:
df.describe()
Out[5]:
HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
count 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000 380.000000
mean 11.063158 11.657895 22.721053 1.744737 1.886842 3.631579 0.055263 0.052632 0.107895
std 3.468853 3.556378 5.176166 1.256371 1.301647 1.969705 0.240050 0.223591 0.319037
min 2.000000 3.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 9.000000 9.000000 19.000000 1.000000 1.000000 2.000000 0.000000 0.000000 0.000000
50% 11.000000 12.000000 23.000000 2.000000 2.000000 4.000000 0.000000 0.000000 0.000000
75% 13.000000 14.000000 25.000000 3.000000 3.000000 5.000000 0.000000 0.000000 0.000000
max 24.000000 24.000000 38.000000 6.000000 6.000000 11.000000 2.000000 1.000000 2.000000

So what do we learn? On average, we have between 3 and 4 yellows in a game and that the away team are only slightly more likely to get more cards. Fouls and red cards are also very close between both teams.

In 68% of games, we expect between 1.5 and 5.5 yellow cards.

At least one game had 38 fouls. That’s roughly one every two and a half minutes!

Describing with groups

Our describe table above is great for a broad brushstroke, but it would be helpful to look at our referees individually. Let’s use .groupby() to create a dataset grouped by the ‘Referee’ column

In [6]:
groupedRefs = df.groupby("Referee")

We can now apply some operations to check out our data by referee:

In [7]:
#All averavges

groupedRefs.describe()
Out[7]:
AwayFouls AwayReds TotalReds TotalYellows
count mean std min 25% 50% 75% max count mean 75% max count mean std min 25% 50% 75% max
Referee
A Marriner 28.0 9.857143 2.977873 5.0 8.00 10.0 12.00 15.0 28.0 0.071429 0.00 1.0 28.0 3.464286 2.081348 0.0 1.75 4.0 5.00 7.0
A Taylor 30.0 12.333333 3.066317 8.0 10.00 12.0 14.75 20.0 30.0 0.000000 0.00 2.0 30.0 3.966667 1.973677 0.0 3.00 4.0 5.00 8.0
C Kavanagh 1.0 12.000000 NaN 12.0 12.00 12.0 12.00 12.0 1.0 0.000000 0.00 0.0 1.0 4.000000 NaN 4.0 4.00 4.0 4.00 4.0
C Pawson 24.0 12.833333 4.380457 3.0 9.75 12.0 15.00 21.0 24.0 0.125000 0.00 1.0 24.0 3.916667 2.339020 0.0 2.75 3.5 5.00 11.0
G Scott 8.0 11.750000 3.011881 8.0 8.75 12.5 14.25 15.0 8.0 0.250000 0.25 1.0 8.0 2.625000 1.187735 1.0 2.00 2.5 3.00 5.0
J Moss 30.0 11.733333 3.805018 5.0 9.00 11.0 14.00 23.0 30.0 0.000000 0.00 1.0 30.0 3.966667 2.008316 1.0 2.25 4.0 5.00 9.0
K Friend 20.0 12.750000 4.050666 7.0 10.75 11.5 14.25 24.0 20.0 0.050000 0.00 1.0 20.0 4.700000 2.154555 1.0 3.00 4.5 6.00 9.0
L Mason 19.0 10.842105 3.041622 5.0 8.50 11.0 13.00 16.0 19.0 0.052632 0.00 1.0 19.0 3.526316 2.412201 0.0 2.00 3.0 5.50 8.0
L Probert 4.0 8.000000 3.559026 3.0 6.75 9.0 10.25 11.0 4.0 0.000000 0.00 0.0 4.0 3.000000 2.449490 1.0 1.00 2.5 4.50 6.0
M Atkinson 27.0 11.851852 3.404790 4.0 9.00 13.0 14.00 18.0 27.0 0.111111 0.00 1.0 27.0 3.222222 2.241794 0.0 1.50 3.0 4.50 8.0
M Clattenburg 22.0 9.954545 3.415333 4.0 8.00 9.5 11.75 17.0 22.0 0.045455 0.00 1.0 22.0 3.272727 1.608958 0.0 2.00 3.0 4.00 6.0
M Dean 30.0 12.100000 2.733572 7.0 10.00 12.0 13.00 17.0 30.0 0.100000 0.00 1.0 30.0 3.666667 1.768173 1.0 2.00 3.0 5.00 7.0
M Jones 18.0 12.333333 3.447079 4.0 9.75 13.0 14.75 17.0 18.0 0.000000 0.00 1.0 18.0 3.388889 2.173067 0.0 2.00 4.0 4.75 8.0
M Oliver 33.0 10.787879 3.434727 5.0 8.00 10.0 14.00 19.0 33.0 0.030303 0.00 1.0 33.0 3.333333 1.947220 0.0 2.00 3.0 5.00 7.0
N Swarbrick 24.0 11.708333 3.263489 6.0 9.75 11.0 14.00 18.0 24.0 0.041667 0.00 1.0 24.0 3.708333 1.398109 1.0 3.00 4.0 5.00 6.0
P Tierney 10.0 12.200000 4.211096 3.0 11.00 14.0 14.75 16.0 10.0 0.100000 0.00 1.0 10.0 2.900000 1.286684 1.0 2.00 3.0 3.00 5.0
R East 11.0 12.090909 2.879394 10.0 10.00 11.0 13.00 19.0 11.0 0.090909 0.00 1.0 11.0 3.272727 1.902152 0.0 2.00 3.0 5.00 6.0
R Madley 30.0 13.433333 3.720246 4.0 11.25 14.0 15.75 20.0 30.0 0.000000 0.00 1.0 30.0 4.133333 1.814374 1.0 3.00 4.0 5.75 8.0
S Attwell 10.0 9.600000 3.835507 5.0 6.25 9.5 12.75 16.0 10.0 0.000000 0.00 0.0 10.0 3.000000 1.943651 0.0 2.00 2.5 3.75 7.0
l Mason 1.0 10.000000 NaN 10.0 10.00 10.0 10.00 10.0 1.0 0.000000 0.00 0.0 1.0 5.000000 NaN 5.0 5.00 5.0 5.00 5.0

20 rows × 72 columns

There is plenty going on here, so while you may want to look through everything yourself, you can also select particular columns:

In [8]:
#Let's analyse yellow cards

groupedRefs.describe()['TotalYellows']
Out[8]:
count mean std min 25% 50% 75% max
Referee
A Marriner 28.0 3.464286 2.081348 0.0 1.75 4.0 5.00 7.0
A Taylor 30.0 3.966667 1.973677 0.0 3.00 4.0 5.00 8.0
C Kavanagh 1.0 4.000000 NaN 4.0 4.00 4.0 4.00 4.0
C Pawson 24.0 3.916667 2.339020 0.0 2.75 3.5 5.00 11.0
G Scott 8.0 2.625000 1.187735 1.0 2.00 2.5 3.00 5.0
J Moss 30.0 3.966667 2.008316 1.0 2.25 4.0 5.00 9.0
K Friend 20.0 4.700000 2.154555 1.0 3.00 4.5 6.00 9.0
L Mason 19.0 3.526316 2.412201 0.0 2.00 3.0 5.50 8.0
L Probert 4.0 3.000000 2.449490 1.0 1.00 2.5 4.50 6.0
M Atkinson 27.0 3.222222 2.241794 0.0 1.50 3.0 4.50 8.0
M Clattenburg 22.0 3.272727 1.608958 0.0 2.00 3.0 4.00 6.0
M Dean 30.0 3.666667 1.768173 1.0 2.00 3.0 5.00 7.0
M Jones 18.0 3.388889 2.173067 0.0 2.00 4.0 4.75 8.0
M Oliver 33.0 3.333333 1.947220 0.0 2.00 3.0 5.00 7.0
N Swarbrick 24.0 3.708333 1.398109 1.0 3.00 4.0 5.00 6.0
P Tierney 10.0 2.900000 1.286684 1.0 2.00 3.0 3.00 5.0
R East 11.0 3.272727 1.902152 0.0 2.00 3.0 5.00 6.0
R Madley 30.0 4.133333 1.814374 1.0 3.00 4.0 5.75 8.0
S Attwell 10.0 3.000000 1.943651 0.0 2.00 2.5 3.75 7.0
l Mason 1.0 5.000000 NaN 5.0 5.00 5.0 5.00 5.0

So Mason gives the highest on average, but only officiates one game. Of our most utilised officials, Friend is the most likely to have given a booking, with 4.7 per game. Pawson, however, had the busiest game, with 11 yellows in a single match. If you’re interested in which game it was, check below.

In [9]:
df[df['TotalYellows']==11]
Out[9]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
171 26/12/2016 Burnley Middlesbrough C Pawson 17 18 35 6 5 11 0 0 0

Who would have thought Burnley v Middlesbrough would’ve been a dirty game?! No festive spirit in this Boxing Day scrap, either way.

This game wasn’t even the one with the most fouls…

In [10]:
df[df['TotalFouls']==38]
Out[10]:
Date HomeTeam AwayTeam Referee HomeFouls AwayFouls TotalFouls HomeYellows AwayYellows TotalYellows HomeReds AwayReds TotalReds
129 27/11/2016 Watford Stoke R Madley 24 14 38 3 0 3 1 0 1

38 fouls, 3 yellows and a red between Mazzarri’s Watford and Stoke. Probably not a classic match! (It finished 1-0 to Stoke if you’re curious).

Summary

In this section, we have seen how using the ‘.describe()’ function makes getting summary statistics for a dataset really easy.

We were able to get results about our data in general, but then get more detailed insights by using ‘.groupby()’ to group our data by referee.

You might want to take a look at our visualisation topics to see how we can put data into charts, or see even more Pandas methods in this section.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Pandas Operations

Let’s take a look at what else pandas can do with our datasets with a few examples of old and new operations.

In [1]:
#Import packages, load csv of data and show the top rows with '.head()'
import pandas as pd
import numpy as np

df = pd.read_csv('data/Results.csv')
df.head()
Out[1]:
Date HomeTeam AwayTeam FTHG FTAG Referee
0 13/08/2016 Burnley Swansea 0 1 J Moss
1 13/08/2016 Crystal Palace West Brom 0 1 C Pawson
2 13/08/2016 Everton Tottenham 1 1 M Atkinson
3 13/08/2016 Hull Leicester 2 1 M Dean
4 13/08/2016 Man City Sunderland 2 1 R Madley

Cool, we have some football results! Let’s look what teams are in the set with .unique() – this gives us the unique values of a column.

In [2]:
print(df['HomeTeam'].unique())

print(str(len(df['HomeTeam'].unique())) + " unique teams!")
['Burnley' 'Crystal Palace' 'Everton' 'Hull' 'Man City' 'Middlesbrough'
 'Southampton' 'Arsenal' 'Bournemouth' 'Chelsea' 'Man United' 'Leicester'
 'Stoke' 'Swansea' 'Tottenham' 'Watford' 'West Brom' 'Sunderland'
 'West Ham' 'Liverpool']
20 unique teams!

20 teams – each Premier League team from that season. A full dataset will have 380 games (20 teams play 19 home games) – let’s test this:

In [3]:
len(df) == (20*19)
Out[3]:
True

Cool – a whole season of results!

What data do we have in the columns?

In [4]:
df.columns
Out[4]:
Index(['Date', 'HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'Referee'], dtype='object')

Date, teams, full time home and away goals and the referee.

Let’s use a new operation – del – to permanantly delete the referees list. This article isn’t interested in trying to find criticism of the refs!

In [5]:
del df['Referee']

Functions with dataFrames

Pandas also allows us to easily apply functions and sums to dataFrames and the series that they are made of.

Below, we will create two new columns:

1) Result – Home score minus away score.

2) ResultText – Strings saying whether home or away won, or if the match was tied

In [6]:
#Series can do lots of sums for us very quickly

df['Result'] = df['FTHG'] - df['FTAG']

#Define a new function that calculates the winner from the above number
def findWinner(value):
    #1 or higher means the home team won
    if (value>0):
        return("Home Win")
    #0 means a tie
    elif (value==0):
        return("Draw")
    #Otherwise, the away team must have won
    else:
        return("Away Win")
    
df['ResultText'] = df['Result'].apply(findWinner)
df.head(3)
Out[6]:
Date HomeTeam AwayTeam FTHG FTAG Result ResultText
0 13/08/2016 Burnley Swansea 0 1 -1 Away Win
1 13/08/2016 Crystal Palace West Brom 0 1 -1 Away Win
2 13/08/2016 Everton Tottenham 1 1 0 Draw

Another application would be to see if more goals are scored by the home or away team. I’m sure you know the answer, but let’s check the averages:

In [7]:
print(df['FTHG'].mean())
print(df['FTAG'].mean())
1.59736842105
1.20263157895

As a broad rule for the season, the home team should expect a 0.4 goal advantage – great analysis!

Let’s do some more. What is the average for home and away goals during home and away wins?

In [8]:
df.groupby('ResultText').mean()
Out[8]:
FTHG FTAG Result
ResultText
Away Win 0.568807 2.504587 -1.935780
Draw 0.928571 0.928571 0.000000
Home Win 2.497326 0.566845 1.930481

Even if you win away, the home team will still score more often than not. So you’ll need to score at least 2 most of the time. Same for home wins too, although teams winning at home score slightly less than teams winning away.

Another question – I’m a fan who loves to see goals, which team should I check out?

In [9]:
#Create a total goals field by adding up home & away.
df['TotalGoals'] = df['FTHG'] + df['FTAG']

#Group dataFrame by home team and look at the mean total goals.
#Then sort in descending order
df.groupby('HomeTeam').mean()['TotalGoals'].sort_values(ascending=False)
Out[9]:
HomeTeam
Chelsea           3.789474
Bournemouth       3.368421
Liverpool         3.315789
Hull              3.315789
Swansea           3.210526
Everton           3.052632
Tottenham         2.947368
Leicester         2.947368
Arsenal           2.894737
Man City          2.842105
Watford           2.842105
Sunderland        2.631579
West Ham          2.631579
West Brom         2.578947
Crystal Palace    2.578947
Stoke             2.526316
Burnley           2.421053
Middlesbrough     2.105263
Man United        2.000000
Southampton       2.000000
Name: TotalGoals, dtype: float64

Looks like we should have watched Chelsea at home this season – nearly 4 goals a game. Massive yawns at Old Trafford, however!

Let’s check out the teams when they play away.

In [10]:
df.groupby('AwayTeam').mean()['TotalGoals'].sort_values(ascending=False)
Out[10]:
AwayTeam
Arsenal           3.473684
Man City          3.421053
Crystal Palace    3.368421
West Ham          3.210526
Bournemouth       3.052632
Liverpool         3.000000
Tottenham         2.947368
Leicester         2.894737
Hull              2.842105
Swansea           2.842105
Watford           2.842105
Southampton       2.684211
Stoke             2.578947
Everton           2.526316
Sunderland        2.526316
Burnley           2.526316
Chelsea           2.421053
Man United        2.368421
West Brom         2.368421
Middlesbrough     2.105263
Name: TotalGoals, dtype: float64

Now Chelsea look pretty boring away from home. Arsenal, City or Palace will get our TV time if they’re playing away.

Your analysis is helping our fan to solve a real-life problem and they will hopefully make better decisions on what to watch – impressive stuff!

Summary

This section takes you through a few new Pandas operations, but they really are the tip of the iceberg. You’ll learn so many more as you read on here and elsewhere.

This page took you through deleting unneeded columns, applying our own functions and sums to create new columns and then used these to solve a problem for a real life fan.

Continue to learn more dataFrame operations and your analysis toolkit will grow exponentially. Now that we have some comfort with dataFrames, you can dive deeper into more complex mathematical applications or even onto visualising and communicating our insights.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Series

If you have just taken a look at NumPy’s arrays, then Pandas’ series will be really easy to pick up.

The key difference between these two data types is that series allow us to label our axes, making our grids a lot easier to read, index and utilise.

Let’s fire up NumPy and Pandas and create some series. Remember to install these modules if you haven’t already.

In [1]:
import numpy as np
import pandas as pd
In [2]:
Capacity = pd.Series(data=[60432,55097,39460])
Capacity
Out[2]:
0    60432
1    55097
2    39460
dtype: int64

So there we have our first series, created from a list of [100,200,300]. You’ll notice that this looks quite different from our previous lists and arrays because we have an index running alongside it.

What is really cool about series, is that they allow us to change these index labels:

In [3]:
Capacity = pd.Series(data=[60432,55097,39460],
                     index=["Emirates Stadium","Etihad Stadium","Elland Road"])
Capacity
Out[3]:
Emirates Stadium    60432
Etihad Stadium      55097
Elland Road         39460
dtype: int64

Passing an index argument changes the index labels – our data is now so much easier to read when we need to. Easier to select, too:

In [4]:
Capacity["Elland Road"]
Out[4]:
39460

In this example, our stadium capacities and labels were in two separate lists. We can do the same thing with a dictionary:

In [5]:
CapacityDict = {'Ewood Park':31367,
                'Liberty Stadium':20937,
                'Portman Road':30311}

Capacity = pd.Series(CapacityDict)
Capacity
Out[5]:
Ewood Park         31367
Liberty Stadium    20937
Portman Road       30311
dtype: int64

Summary

Told you series would be easy to understand. A simple concept, but one that makes our data a bit more comfortable to use – we can now understand data by labels, not just index numbers.

Pandas’ data frame builds on this further to create labelled grids. Once we understand these we can really get started with data analysis in Python.

Posted by FCPythonADMIN in Data Analysis, 0 comments