Filter Python database to find better, cheaper players using FIFA’s ratings

In this tutorial, we’re going to play the role of a Chairman at a team that wants to reduce the wage bill, but bring in players perceived to be at the same level. Our team has a limited scouting network, so we’re going to make use of a database that comes with player rankings, wages and positions – taken from the FIFA database.

Of course, there is lot more that we would need to take into account when scouting a player. This is half-tongue in cheek tutorial, half-still better than lots of multi-million deals 👀

The aim of the tutorial is to learn about filtering information in larger datasets. We’ll filter it based on a text value that we will provide manually, and also on whether or not a value is greater or smaller than another one – wage bill, for example.

The data for this project can be found on Kaggle here. Let’s get our modules imported and save our team some money!

In [1]:
#Import modules
import numpy as np
import pandas as pd 

We’ll open the dataset with pandas. We have it within a folder, so we navigate to the folder and name the csv. We’ll name this file ‘data’.

Calling .head() will allow us to check out the structure. The dataset is pretty rich and we can get loads of biographical data, in addition to ratings. Loads more that you can do with this beyond this demo!

In [2]:
#load data
data = pd.read_csv("fifa-20-complete-player-dataset/players_20.csv")
data.head()
Out[2]:
sofifa_id player_url short_name long_name age dob height_cm weight_kg nationality club lwb ldm cdm rdm rwb lb lcb cb rcb rb
0 158023 https://sofifa.com/player/158023/lionel-messi/… L. Messi Lionel Andrés Messi Cuccittini 32 1987-06-24 170 72 Argentina FC Barcelona 68+2 66+2 66+2 66+2 68+2 63+2 52+2 52+2 52+2 63+2
1 20801 https://sofifa.com/player/20801/c-ronaldo-dos-… Cristiano Ronaldo Cristiano Ronaldo dos Santos Aveiro 34 1985-02-05 187 83 Portugal Juventus 65+3 61+3 61+3 61+3 65+3 61+3 53+3 53+3 53+3 61+3
2 190871 https://sofifa.com/player/190871/neymar-da-sil… Neymar Jr Neymar da Silva Santos Junior 27 1992-02-05 175 68 Brazil Paris Saint-Germain 66+3 61+3 61+3 61+3 66+3 61+3 46+3 46+3 46+3 61+3
3 200389 https://sofifa.com/player/200389/jan-oblak/20/… J. Oblak Jan Oblak 26 1993-01-07 188 87 Slovenia Atlético Madrid NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 183277 https://sofifa.com/player/183277/eden-hazard/2… E. Hazard Eden Hazard 28 1991-01-07 175 74 Belgium Real Madrid 66+3 63+3 63+3 63+3 66+3 61+3 49+3 49+3 49+3 61+3

5 rows × 104 columns

The dataset has over 18,000 players, so we are going to select a team to save some money for. Let’s put ourselves in the shoes of a business-minded Chairman at Manchester United.

First up, let’s take stock of our own team. Here, we are going to filter based on a specific piece of text that we know – the club name.

We’ll create a function that filters the big database for a match with a team name. There are lots of columns to the main dataset, so we’re only going to extract the name, wage, value, positions, ranking and age of our players.

In [3]:
#Define function called club, that looks for a team name and return the selected columns
def club(teamName):
    
    return data[data['club'] == teamName][['short_name','wage_eur','value_eur','player_positions','overall','age']]


#Use the club function to find the team, and sort the squad by wage bill
club('Manchester United').sort_values("wage_eur", ascending = False)
Out[3]:
short_name wage_eur value_eur player_positions overall age
24 P. Pogba 250000 72500000 CM, CDM 88 26
14 De Gea 205000 56000000 GK 89 28
301 A. Sánchez 150000 21000000 LW, LM 82 30
163 A. Martial 140000 34500000 LW, ST, LM 83 23
295 Juan Mata 135000 19500000 CAM, RM 82 31
319 J. Lingard 135000 24000000 CAM, RW 82 26
171 M. Rashford 125000 35500000 ST, LM 83 21
280 H. Maguire 125000 23000000 CB 82 26
534 C. Smalling 115000 12500000 CB 80 29
396 N. Matić 115000 14500000 CDM, CM 81 30
644 Fred 110000 15000000 CM 79 26
462 E. Bailly 105000 16500000 CB 80 25
654 P. Jones 100000 12500000 CB 79 27
347 L. Shaw 97000 20000000 LB 81 23
335 V. Lindelöf 97000 22000000 CB 81 24
957 M. Rojo 95000 8500000 CB 78 29
508 S. Romero 85000 9000000 GK 80 32
1140 A. Young 85000 4000000 RB, LB 77 33
1044 Andreas Pereira 81000 11500000 CM, CAM, RM 77 23
1962 M. Darmian 79000 5500000 LB, RB 75 29
1028 S. McTominay 73000 11500000 CDM, CM 77 22
571 A. Wan-Bissaka 71000 17500000 RB 79 21
1631 Diogo Dalot 54000 10000000 RB, LB, RM 75 20
3299 D. James 46000 5500000 LM, LW, RM 72 21
2661 A. Tuanzebe 46000 6500000 CB, RB 73 21
2702 T. Fosu-Mensah 46000 5500000 RB, CB 73 21
3689 L. Grant 30000 525000 GK 72 36
8010 C. Borthwick-Jackson 23000 925000 LB 67 22
8839 D. Mitchell 19000 1000000 LB 66 22
9901 T. Chong 14000 1200000 RW, LW, RM 65 19
6622 A. Gomes 12000 1800000 CAM, CM, LW 68 18
7690 M. Greenwood 6000 1500000 CF, RM, ST 67 17
12167 J. Garner 5000 725000 CDM, CM, CB 63 18

Now we have all of our players, we need to find a way to take information from a player we may want to replace and use it to search the initial database.

Let’s start with an easy example. We’re going the Extract De Gea’s information from the database, assign his wage and rating to variables and finally recommend a player who is as good as him, but has a cheaper wage.

First step, get the DDG information and variables sorted:

In [4]:
#Extract DDG's information, just like we did with the team name before
DDG = data[data['short_name'] == 'De Gea'][['short_name','wage_eur','value_eur','player_positions','overall','age']]

#Assign DDG's wage, position, rating and age to variables
DDGWage = DDG['wage_eur'].item()
DDGPos = DDG['player_positions'].item()
DDGRating = DDG['overall'].item()
DDGAge = DDG['age'].item()

Secondly, we are going to filter the initial database by these values. To do this, we will create a longlist of players that match DDG’s position. From there, we need to slim the list down to meet other criteria. So we will create lists of players that do not meet requirements, then delete them from the longlist.

In [5]:
#Create a list of goalkeepers, matching DDG's position
longlist = data[data['player_positions'] == DDGPos][['short_name','wage_eur','value_eur','player_positions','overall','age']]

#Create a list of players that have a lower overall rank than DDG
removals = longlist[longlist['overall'] <= DDGRating].index
#Drop these players
longlist.drop(removals , inplace=True)

#Repeat above, but for players with a larger wage
removals = longlist[longlist['wage_eur'] > DDGWage].index
longlist.drop(removals , inplace=True)

#Repeat above, but for older players
removals = longlist[longlist['age'] >= DDGAge].index
longlist.drop(removals , inplace=True)

#Show me our potential replacements, sorted by lowest wages
longlist.sort_values("wage_eur")
Out[5]:
short_name wage_eur value_eur player_positions overall age
3 J. Oblak 125000 77500000 GK 91 26

And we have one player that meets the criteria – Jan Oblak!

He ticks all of the boxes we laid out, cheaper, better and younger. In fact, his weekly wage is €80,000 cheaper! 🤑🤑

Chairmen of the world, happy to audit your team for a bargain fee of 1% of potential savings! Alternatively, if you would like to do it yourselves, it would save some time to put the above into a function. Let’s give that a go.

We’ll do exactly what we did before, but we’re going to add a second argument to the function that allows us to look at players rated lower than our own. We’ll call this skillReduction and use it to change our filter criteria for the rating piece. It is set to 0 by default, in this case it will work just as it did above.

In [6]:
def cheapReplacement(player, skillReduction = 0):
    
    #Get the replacee with the name provided in the argument
    replacee = data[data['short_name'] == player][['short_name','wage_eur','value_eur','player_positions','overall','age']]

    #Assign the relevant details of this player to variables
    replaceePos = replacee['player_positions'].item()
    replaceeWage = replacee['wage_eur'].item()
    replaceeAge = replacee['age'].item()
    replaceeOverall = replacee['overall'].item() - skillReduction
    
    #Create the longlist of players that share the position
    longlist = data[data['player_positions'] == replaceePos][['short_name','wage_eur','value_eur','player_positions','overall','age']]
    
    
    #Create list of players that do not meet the rating criteria and drop them from the longlist
    removals = longlist[longlist['overall'] <= replaceeOverall].index
    longlist.drop(removals , inplace=True)
    
    #Repeat for players with higher wages
    removals = longlist[longlist['wage_eur'] > replaceeWage].index
    longlist.drop(removals , inplace=True)

    #Repeat for older players
    removals = longlist[longlist['age'] >= replaceeAge].index
    longlist.drop(removals , inplace=True)
    
    #Display the players that meet the requirements
    return longlist.sort_values("wage_eur")

And let’s give it a go for Fred

In [7]:
cheapReplacement('Fred')
Out[7]:
short_name wage_eur value_eur player_positions overall age
330 H. Aouar 59000 26000000 CM 81 21
259 K. Demirbay 75000 27500000 CM 82 25
245 Rúben Neves 89000 29000000 CM 82 22

Cha-ching, 46k-16k a week saved. Give me the job.

“FC Python, give us a cheaper centre back!”

In [8]:
cheapReplacement('V. Lindelöf')
Out[8]:
short_name wage_eur value_eur player_positions overall age
165 J. Tah 67000 31000000 CB 83 23
246 P. Kimpembe 75000 26500000 CB 82 23
73 M. de Ligt 76000 50000000 CB 85 19
170 M. Akanji 77000 31000000 CB 83 23
177 D. Sánchez 93000 30500000 CB 83 23

And another 30k-4k. Crack open Youtube and get your eyes around ‘TAH – WELCOME TO UNITED – GOALS & MONSTER TACKLES 19/20’.

Last one, we’re going to check out Pogba’s possible replacements. And we’re going to assume that FIFA overrates him by 8 points, so we’ll accept players up to that level.

In [9]:
cheapReplacement('P. Pogba', 8)
Out[9]:
short_name wage_eur value_eur player_positions overall age
331 T. Ndombele 87000 26000000 CM, CDM 81 22
243 N. Keïta 95000 29000000 CM, CDM 82 24
166 C. Tolisso 110000 34000000 CM, CDM 83 24
338 E. Can 110000 23000000 CM, CDM 81 25
173 A. Rabiot 120000 33000000 CM, CDM 83 24
75 F. de Jong 195000 52000000 CM, CDM 85 22

Squad management/filtering dataframes 101 done, get your job applications out.

Developing this further, we could of course take more factors into account than the overall rating. Look at potential ability, key ability ratings, etc. for a better fit for the player.

You may also want to run the function that we made over a whole team to see potential savings. Then do it against loads of teams and compare squad strengths and potential savings. Would be interesting to see which teams are over and under paying for their squad!