Numpy

How much does it cost to fill the Panini World Cup album? Simulations in Python

With the World Cup just 3 months away, the best bit of the tournament build up is upon us – the Panini sticker album.

For those looking to invest in a completed album to pass onto grandchildren, just how much will you have to spend to complete it on your own? Assuming that each sticker has an equal chance of being found, this is a simple random number problem that we can recreate in Python.

This article will show you how to create a function that allows you to estimate how much you will need to spend, before you throw wads of cash at sticker boxes to end with a half-finished album. Load up pandas and numpy and let’s kick on.

In [1]:
import pandas as pd
import numpy as np

To solve this, we are going to recreate our sticker album. It will be an empty list that will take on the new stickers that we find in each pack.

We will also need a few variables to act as counters alongside this list:

  • Stickers needed
  • How many packets have we bought?
  • How many swaps do we have?

Let’s define these:

In [1]:
stickersNeeded = 682
packetsBought = 0
stickersGot = []
swapStickers = 0

Now, we need to run a simulation that will open packs, check each sticker and either add it to our album or to our swaps pile.

We will do this by running a while loop that completes once the album is full.

This loop will open a pack of 5 stickers and check whether or not it is featured in the album already. To simulate the sticker, we will simply assign it a random number within the album. If this number is already present, we add it to the swap pile. If it is a new sticker, we append it to our album list.

We will also need to update our counters for packets bought, stickers needed and swaps throughout.

Pretty simple process overall! Let’s take a look at how we implement this loop:

In [2]:
while stickersNeeded > 0:
    
        #Buy a new packet
        packetsBought += 1

        #For each sticker, do some things 
        for i in range(0,5):
            
            #Assign the sticker a random number
            stickerNumber = np.random.randint(0,681)
    
            #Check if we have the sticker
            if stickerNumber not in stickersGot:
                
                #Add it to the album, then reduce our stickers needed count
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            #Throw it into the swaps pile
            else:
                swapStickers += 1

Each time you run that, you are simulating the entire album completion process! Let’s check out the results:

In [3]:
{"Packets":packetsBought,"Swaps":swapStickers}
Out[3]:
{'Packets': 939, 'Swaps': 4013}

939 packets?! 4013 swaps?! Surely these must be outliers… let’s add all of this into one function and run it loads of times over.

As the number of stickers in a pack and the sticker total may change, let’s define these as arguments that we can change with future uses of the function:

In [4]:
def calculateAlbum(stickersInPack = 5, costOfPackp = 80, stickerTotal=682):
    stickersNeeded = stickerTotal
    packetsBought = 0
    stickersGot = []
    swapStickers = 0


    while stickersNeeded > 0:
        packetsBought += 1

        for i in range(0,stickersInPack):
            stickerNumber = np.random.randint(0,stickerTotal)

            if stickerNumber not in stickersGot:
                stickersGot.append(stickerNumber)
                stickersNeeded -= 1

            else:
                swapStickers += 1

    return{"Packets":packetsBought,"Swaps":swapStickers,
           "Total Cost":(packetsBought*costOfPackp)/100}
In [5]:
calculateAlbum()
Out[5]:
{'Packets': 1017, 'Swaps': 4403, 'Total Cost': 813.6}

So our calculateAlbum function does exactly the same as our instructions before, we have just added a total cost.

Let’s run this 1000 times over and see what we can truly expect if we want to complete the album:

In [6]:
a=0
b=0
c=0

for i in range(0, 1000):
    a += calculateAlbum()["Packets"]
    b += calculateAlbum()["Swaps"]
    c += calculateAlbum()["Total Cost"]

{"Packets":a/1000,"Swaps":b/1000,"Total Cost":c/1000}
Out[6]:
{'Packets': 969.582, 'Swaps': 4197.515, 'Total Cost': 773.4824}

970 packets, over 4000 swaps and the best part of £800 on the album. I think we’re going to need some people to swap with!

Of course, as you run these arguments, you will have different answers throughout. Hopefully here, however, our numbers are quite close together.

Summary

In this article, we have seen a basic example of running simulations with random numbers to answer a question.

We followed the process of replicating the album experience and running it once, then 1000 times to get an average expectation. As with any process involving random numbers, you will get different answers each time, so through running it loads of times over, we get an average that should remove the effect of any outliers.

We also designed our simulations to take on different parameters such as number of stickers needed, stickers in a pack, etc. This allows us to use the same functions when World Cup 2022 has twice the number of stickers!

For more examples of random numbers and simulations, check out our expected goals tutorial.

Posted by FCPythonADMIN in Blog

Calculating ‘per 90’ with Python and Fantasy Football

When we are comparing data between players, it is very important that we standardise their data to ensure that each player has the same ‘opportunity’ to show their worth. The simplest way for us to do this, is to ensure that all players have the same amount of time within which to play. One popular way of doing this in football is to create ‘per 90’ values. This means that we will change our total amounts of goals, shots, etc. to show how many a player will do every 90 minutes of football that they play. This article will run through creating per 90 figures in Python by applying them to fantasy football points and data.

Follow the examples along below and feel free to use them where you are. Let’s get started by importing our modules and taking a look at our data set.

In [1]:
import numpy as np
import pandas as pd

data = pd.read_csv("../Data/Fantasy_Football.csv")
data.head()
Out[1]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_saved penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0 0.0 1 1
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 0 1 0 53 4 419 42.7 1 1
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0 0.0 1 1
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 0 3 0 0 14 421 62.5 2 1
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 0 2 77 15.7 2 1

5 rows × 26 columns

Our data has a host of data on our players’ fantasy football performance. We have their names, of course, and also their points and contributing factors (goals, clean sheets, etc.). Crucially, we have the players’ minutes played – allowing us to calculate their per 90 figures for the other variables.

Calculating our per 90 numbers is reasonably simple, we just need to find out how many 90 minute periods our player has played, then divide the variable by this value. The function below will show this step-by-step and show Kane’s goals p90 in the Premier League at the time of writing (goals = 20, minutes = 1868):

In [2]:
def p90_Calculator(variable_value, minutes_played):
    
    ninety_minute_periods = minutes_played/90
    
    p90_value = variable_value/ninety_minute_periods
    
    return p90_value

p90_Calculator(20, 1868)
Out[2]:
0.9635974304068522

There we go, Kane scores 0.96 goals per 90 in the Premier League! Our code, while explanatory is three lines long, when it can all be in one line. Let’s try again, and check that we get the same value:

In [3]:
def p90_Calculator(value, minutes):
    return value/(minutes/90)

p90_Calculator(20, 1868)
Out[3]:
0.9635974304068522

Great job! The code has the same result, in a third of the lines, and I still think it is fairly easy to understand.

Next up, we need to apply this to our dataset. Pandas makes this easy, as we can simply call a new column, and run our command with existing columns as arguments:

In [4]:
data["total_points_p90"] = p90_Calculator(data.total_points,data.minutes)
data.total_points_p90.fillna(0, inplace=True)
data.head()
Out[4]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
0 Ospina 3 David Ospina 13 48 0 0.2 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
1 Cech 3 Petr Cech 33 54 0 4.9 84 3.7 0 1 0 53 4 419 42.7 1 1 3.652174
2 Martinez 3 Damian Emiliano Martinez 26 40 0 0.6 0 0.0 0 0 0 0 0 0 0.0 1 1 0.000000
3 Koscielny 3 Laurent Koscielny 6 60 2 1.6 76 4.2 0 3 0 0 14 421 62.5 2 1 4.288401
4 Mertesacker 3 Per Mertesacker 4 48 1 0.5 15 3.0 0 0 0 0 2 77 15.7 2 1 3.846154

5 rows × 27 columns

And there we have a total points per 90 column, which will hopefully offer some more insight than a simple points total. Let’s sort our values and view the top 5 players:

In [5]:
data.sort_values(by='total_points_p90', ascending =False).head()
Out[5]:
web_name team_code first_name second_name squad_number now_cost dreamteam_count selected_by_percent total_points points_per_game penalties_missed yellow_cards red_cards saves bonus bps ict_index element_type team total_points_p90
271 Tuanzebe 1 Axel Tuanzebe 38 39 0 1.7 1 1.0 0 0 0 0 0 3 0.0 2 12 90.0
322 Sims 20 Joshua Sims 39 43 0 0.1 1 1.0 0 0 0 0 0 3 0.0 3 14 90.0
394 Janssen 6 Vincent Janssen 9 74 0 0.1 1 1.0 0 0 0 0 0 2 0.0 4 17 90.0
166 Hefele 38 Michael Hefele 44 42 0 0.1 1 1.0 0 0 0 0 0 4 0.4 2 8 90.0
585 Silva 13 Adrien Sebastian Perruchet Silva 14 60 0 0.0 1 1.0 0 0 0 0 0 5 0.3 3 9 22.5

5 rows × 27 columns

Huh, probably not what we expected here… players with 1 point, and some surprisng names too. Upon further examination, these players suffer from their sample size. They’ve played very few minutes, so their numbers get overly inflated… there’s obviously no way a player gets that many points per 90!

Let’s set a minimum time played to our data to eliminate players without a big enough sample:

In [6]:
data.sort_values(by='total_points_p90', ascending =False)[data.minutes>400].head(10)[["web_name","total_points_p90"]]
Out[6]:
web_name total_points_p90
233 Salah 9.629408
279 Martial 8.927126
246 Sterling 8.378721
225 Coutinho 8.358882
325 Austin 8.003356
278 Lingard 7.951807
544 Niasse 7.460317
256 Agüero 7.346939
389 Son 7.288503
255 Bernardo Silva 7.119403

That seems a bit more like it! We’ve got some of the highest scoring players here, like Salah and Sterling, but if Austin, Lingard and Bernardo Silva can nail down long-term starting spots, we should certainly keep an eye on adding them in!

Let’s go back over this by creating a new column for goals per 90 and finding the top 10:

In [7]:
data["goals_p90"] = p90_Calculator(data.goals_scored,data.minutes)
data.goals_p90.fillna(0, inplace=True)
data.sort_values(by='goals_p90', ascending =False)[data.minutes>400].head(10)[["web_name","goals_p90"]]
Out[7]:
web_name goals_p90
233 Salah 0.968320
393 Kane 0.967222
325 Austin 0.906040
256 Agüero 0.823364
246 Sterling 0.797973
544 Niasse 0.793651
279 Martial 0.728745
258 Jesus 0.714995
278 Lingard 0.632530
160 Rooney 0.630252

Great job! Hopefully you can see that this is a much fairer way to rate our player data – whether for performance, fantasy football or media reporting purposes.

Summary

p90 data is a fundamental concept of football analytics. It is one of the first steps of cleaning our data and making it fit for comparisons. This article has shown how we can apply the concept quickly and easily to our data. For next steps, you might want to take a look at visualising this data, or looking at further analysis techniques.

Posted by FCPythonADMIN in Blog

Indexing NumPy Arrays

In the Arrays intro, you probably noticed an example where we used square brackets after an array to select a specific part of the array. In this article, we will see how we can identify and select parts of our arrays, whether 1d or 2d.

Let’s get started by importing our NumPy module and setting up an array of World Cup years. We’ll do this by calling ‘arange’ for every 4 years, then using ‘np.delete’ – a numpy function to remove parts of an array – to remove 1942 & 1946 (these are in locations 3 & 4).

In [1]:
import numpy as np

#Every 4 years since 1930
WCYears = np.arange(1930,2018,4)

#No World Cup in 1942 or 1946
WCYears = np.delete(WCYears,(3,4))

WCYears
Out[1]:
array([1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974, 1978,
       1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014])

Bracket selection

Following an array with square brackets is the easiest way to select an individual value or range.

Two important things to remember that will be second nature to you soon:

1) Any range includes the first number, but not the final one.
2) Indexes begin at 0 in Python.

In [2]:
#What year was the third World Cup held?
WCYears[2]
Out[2]:
1938
In [3]:
#Show me the 4 World Cup years following WW2
WCYears[3:7]
Out[3]:
array([1950, 1954, 1958, 1962])

Bracket selection allows you to make changes to any of these figures, just like you would do with a variable. Be careful, though, as you cannot undo this and will have to go several steps back!

Selections in a 2d array (grid)

Bracket selection is also used to make selections on a grid. We have two options to do so:

grid[row][column] OR grid[row,column]

Both are essentially the same, so use whatever works for you and be aware that you may see it differently elsewhere!

In [4]:
#Create our 2d array
WCYears = [2002,2006,2010,2014]
WCHosts = ["Japan/Korea","Germany","South Africa","Brazil"]
WCWinners = ["Brazil","Italy","Spain","Germany"]

WCArray = np.array((WCYears,WCHosts,WCWinners))
WCArray
Out[4]:
array([['2002', '2006', '2010', '2014'],
       ['Japan/Korea', 'Germany', 'South Africa', 'Brazil'],
       ['Brazil', 'Italy', 'Spain', 'Germany']],
      dtype='<U12')
In [5]:
#2010 is the third year, find the host in the second row
WCArray[1,2]
Out[5]:
'South Africa'
In [6]:
#Find the winner of the last World Cup
#Negative selection!

WCArray[2,-1]
Out[6]:
'Germany'

Selecting parts of an array with criteria

So far, we have only selected values when we know their location. Quite often, we won’t know where things are, or will want to find something completely new.

NumPy allows us to select based on criteria that we give it. We will give it a test and if numbers return as ‘True’, then it will give them to us.

In [7]:
WCYears = np.array([1966,1970,1974,1978])
WCTopScorers = np.array(["Eusebio","Muller","Lato","Kempes"])
WCGoals = np.array([9,10,7,6])
In [8]:
#Where does the top scorer score more than 8 goals?
WCGoals > 8
Out[8]:
array([ True,  True, False, False], dtype=bool)
In [9]:
#Not particularly useful, but we can then use bracket selection with this!

WCTopScorers[(WCGoals>8)]
Out[9]:
array(['Eusebio', 'Muller'],
      dtype='<U7')

As you can see, the first query (‘WCGoals>8′) returns an array of True or False. We then plug this into another array, which will return only the locations that are True. This allows us to get the scorers’ names, not just a True or False. This is useful with small arrays, but will be a massive help when we deal with bigger datasets.

Summary

Selecting values in either a 1d or 2d array is really easy. If we know where we want to look, we have square brackets containing index numbers – one number for a 1d array, or 2 numbers for row/column in a grid.

However, we do not just have to pass in the index, we can pass in an array of True or False values that allow us to filter based on a criteria.

Posted by FCPythonADMIN in Data Analysis, 0 comments

Arrays in NumPy

NumPy is a fundamental package for data analysis in Python as the majority of other packages in the Python data eco-system build on it. Subsequently, it makes sense for us to have an understanding of what NumPy can help us with and its general principles.

In the following article, we’ll take a look at arrays in Python – which essentially take the ‘lists’ data type to a new level. We’ll have powerful new methods, random number generation and a way of storing data in grid-like structures, not just lists like we have seen.

Let’s get things started and import the numpy library. Take a read here if you need to install it!

In [1]:
import numpy as np

Creating a NumPy array

Firstly, we need to create our array. We have a number of different ways to do this.

One way is to convert a pre-existing list into an array. Below, we do this to create a 1d array (one line) and a 2d array (a grid, or matrix).

In [2]:
#Three lists, one for GK heights, one for GK weights, one for names

GKNames = ["Kaller","Fradeel","Hayward","Honeyman"]
GKHeights = [184,188,191,193]
GKWeights = [81,85,103,99]

#Create an array of names

print(np.array(GKNames))

#Create a matrix of all three lists, start with a list of lists

GKMatrix = [GKNames,GKHeights,GKWeights]
print(np.array(GKMatrix))
['Kaller' 'Fradeel' 'Hayward' 'Honeyman']
[['Kaller' 'Fradeel' 'Hayward' 'Honeyman']
 ['184' '188' '191' '193']
 ['81' '85' '103' '99']]

There we have two examples of creating arrays from a list. Our second one is particularly cool – is just like a spreadsheet and will make our data much easier to deal with.

Aside from creating our own arrays from lists we already have, numpy can create them with its own methods:

In [3]:
#With 'arange', we can create arrays just like we created lists with 'range'
#This gives us an array ranging from the numbers in the arguments

np.arange(0,12)
Out[3]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
In [4]:
#Want a blank array? Create it full of zeros with 'zeros'
#The argument within it create the shape of a 2d or 3d array

np.zeros((3,11))
Out[4]:
array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [5]:
#Hate zeros? Why not use 'ones'?!

np.ones((3,11))
Out[5]:
array([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])
In [6]:
#Creating dummy data or need a random number?
#randint and randn are useful here

#Creates random numbers around a standard distribution from 0
#The argument gives us the array's shape
print(np.random.randn(3,3))

#Creates random numbers between two numbers that we give it
#The third argument gives us the shape of the array
print(np.random.randint(1,100,(3,3)))
[[ 1.1403024  -1.76082025 -0.71738168]
 [-0.44740344 -0.16392845  1.04022957]
 [ 1.97068835  0.50075891 -0.33750378]]
[[70 28 67]
 [19 54 11]
 [ 9 34 67]]

Looking for more ways to create arrays? Take a look in the documentation for ‘rand’, ‘linspace’, ‘eye’ and others!

Array Methods

Not only does NumPy give us a good way to store our data, it also gives us some great tools to simplify working with it.

Let’s find the tallest goalkeeper from our earlier examples with array methods.

In [7]:
#Three lists, one for GK heights, one for GK weights, one for names
#Create an array with each list

GKNames = ["Kaller","Fradeel","Hayward","Honeyman"]
GKHeights = [184,188,191,193]
GKWeights = [81,85,103,99]

np.array(GKNames)
GKHeights = np.array(GKHeights)
np.array(GKWeights)

#What is the largest height, .max()?

GKHeights.max()
Out[7]:
193
In [8]:
#What location is the max, .argmax()?

GKHeights.argmax()
Out[8]:
3
In [9]:
#Can I use this method to locate the player's name?
#Instead of a number in the square brackets, I can just put this method

GKNames[GKHeights.argmax()]
Out[9]:
'Honeyman'

With only four players this is a bit long-winded, but I’m sure that you can see the benefit if we have a whole academy of players and we need to find our tallest player from 100s. Swap the max to min to find the smallest value in an array.

Summary

You are likely to use NumPy with all sorts of packages as you develop your Python skills. Having a healthy appreciation of how it works, especially with arrays, will save you lots of headaches down the line.

In this page, we saw how we can create them from scratch, or convert them from lists. We created flat, 1-d arrays and 2-d grids. We then applied methods to find highest datapoints and even used these to navigate our grid. Great work! Take a look at our extension on NumPy arrays here to learn more.

Posted by FCPythonADMIN in Data Analysis, 0 comments