At some point in your dealings with data, you will have to deal with missing values.

Depending on the situation, you might need to fill these gaps, or delete rows/columns entirely. Luckily Pandas makes this easy for us.

Let’s get our modules and dataset prepared, before we look to delete series or fill gaps.

In [1]:

import numpy as np
import pandas as pd

#DataFrame is the contract details for our transfer targets, where known.
#'np.nan' is a numpy value that shows that there is not a number.
# In this case, it demonstrates missing data.

df = pd.DataFrame({'Wage':[150000,123000,np.nan],
'GoalBonus':[4000,np.nan,np.nan],
'ImageRights':[50000,70000,100000]}, 
index=['Konda','Makho','Grey'],
columns=['Wage','GoalBonus','ImageRights'])

df

Out[1]:

	Wage	GoalBonus	ImageRights
Konda	150000.0	4000.0	50000
Makho	123000.0	NaN	70000
Grey	NaN	NaN	100000

So we know everything about Konda, 2/3 values for Makho and just Grey’s image rights. Let’s look at how we can tidy this up.

Removing rows & columns with missing data

If you decide to bin the players with missing data, it is simple with the ‘.dropna()’ method:

In [2]:

df.dropna()

Out[2]:

	Wage	GoalBonus	ImageRights
Konda	150000.0	4000.0	50000

So we are only left with the player with all values. What if you want to do the same for columns?

In [3]:

df.dropna(axis=1)

Out[3]:

	ImageRights
Konda	50000
Makho	70000
Grey	100000

And now we have only the column with complete data – ImageRights. You’ll see that we pass the argument ‘axis=1’ to do look at columns, not rows.

‘.dropna()’ can also take the argument ‘thresh’ to change the amount of missing values you’re happy to deal with. Makho has only 1 missing value, whereas Grey has 2. Below, we’ll allow Makho into our dataset, but continue to exclude Grey:

In [4]:

df.dropna(thresh=2)

Out[4]:

	Wage	GoalBonus	ImageRights
Konda	150000.0	4000.0	50000
Makho	123000.0	NaN	70000

Fill data

Sometimes, deleting rows and columns is a bit drastic. You may instead want to simply fill in the gaps instead. Rather than ‘.dropna()’, we can instead ‘.fillna()’, passing the desired value as the argument.

In [5]:

df.fillna(value=0)

Out[5]:

	Wage	GoalBonus	ImageRights
Konda	150000.0	4000.0	50000
Makho	123000.0	0.0	70000
Grey	0.0	0.0	100000

That’s a lot cleaner if we want to perform other opearations, great work!

You might want to be a bit smarter than filling with 0s. As an example, you might want to take a column and use the average to fill the rest of the gaps:

In [6]:

df['Wage'].fillna(value=df['Wage'].mean())

Out[6]:

Konda 150000.0
Makho 123000.0
Grey 136500.0
Name: Wage, dtype: float64

Summary

Dealing with datasets can often be unwieldy and messy. Quite commonly, this is caused by missing values. Pandas helps us greatly with ‘.fillna()’ and ‘.dropna()’ and their ability to remove or change missing values with ease.

If you are having issues with mismatching or datasets split across different files, you might want to check out the joining & merging data article.