Dealing with Missing Data

At some point in your dealings with data, you will have to deal with missing values.

Depending on the situation, you might need to fill these gaps, or delete rows/columns entirely. Luckily Pandas makes this easy for us.

Let’s get our modules and dataset prepared, before we look to delete series or fill gaps.

In [1]:
import numpy as np
import pandas as pd

#DataFrame is the contract details for our transfer targets, where known.
#'np.nan' is a numpy value that shows that there is not a number.
# In this case, it demonstrates missing data.

df = pd.DataFrame({'Wage':[150000,123000,np.nan],
'GoalBonus':[4000,np.nan,np.nan],
'ImageRights':[50000,70000,100000]}, 
index=['Konda','Makho','Grey'],
columns=['Wage','GoalBonus','ImageRights'])

df
Out[1]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 NaN 70000
Grey NaN NaN 100000

So we know everything about Konda, 2/3 values for Makho and just Grey’s image rights. Let’s look at how we can tidy this up.

Removing rows & columns with missing data

If you decide to bin the players with missing data, it is simple with the ‘.dropna()’ method:

In [2]:
df.dropna()
Out[2]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000

So we are only left with the player with all values. What if you want to do the same for columns?

In [3]:
df.dropna(axis=1)
Out[3]:
ImageRights
Konda 50000
Makho 70000
Grey 100000

And now we have only the column with complete data – ImageRights. You’ll see that we pass the argument ‘axis=1’ to do look at columns, not rows.

‘.dropna()’ can also take the argument ‘thresh’ to change the amount of missing values you’re happy to deal with. Makho has only 1 missing value, whereas Grey has 2. Below, we’ll allow Makho into our dataset, but continue to exclude Grey:

In [4]:
df.dropna(thresh=2)
Out[4]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 NaN 70000

Fill data

Sometimes, deleting rows and columns is a bit drastic. You may instead want to simply fill in the gaps instead. Rather than ‘.dropna()’, we can instead ‘.fillna()’, passing the desired value as the argument.

In [5]:
df.fillna(value=0)
Out[5]:
Wage GoalBonus ImageRights
Konda 150000.0 4000.0 50000
Makho 123000.0 0.0 70000
Grey 0.0 0.0 100000

That’s a lot cleaner if we want to perform other opearations, great work!

You might want to be a bit smarter than filling with 0s. As an example, you might want to take a column and use the average to fill the rest of the gaps:

In [6]:
df['Wage'].fillna(value=df['Wage'].mean())
Out[6]:
Konda 150000.0
Makho 123000.0
Grey 136500.0
Name: Wage, dtype: float64

Summary

Dealing with datasets can often be unwieldy and messy. Quite commonly, this is caused by missing values. Pandas helps us greatly with ‘.fillna()’ and ‘.dropna()’ and their ability to remove or change missing values with ease.

If you are having issues with mismatching or datasets split across different files, you might want to check out the joining & merging data article.

Leave a Reply