At some point in your dealings with data, you will have to deal with missing values.
Depending on the situation, you might need to fill these gaps, or delete rows/columns entirely. Luckily Pandas makes this easy for us.
Let’s get our modules and dataset prepared, before we look to delete series or fill gaps.
import numpy as np
import pandas as pd
#DataFrame is the contract details for our transfer targets, where known.
#'np.nan' is a numpy value that shows that there is not a number.
# In this case, it demonstrates missing data.
df = pd.DataFrame({'Wage':[150000,123000,np.nan],
'GoalBonus':[4000,np.nan,np.nan],
'ImageRights':[50000,70000,100000]},
index=['Konda','Makho','Grey'],
columns=['Wage','GoalBonus','ImageRights'])
df
So we know everything about Konda, 2/3 values for Makho and just Grey’s image rights. Let’s look at how we can tidy this up.
Removing rows & columns with missing data
If you decide to bin the players with missing data, it is simple with the ‘.dropna()’ method:
df.dropna()
So we are only left with the player with all values. What if you want to do the same for columns?
df.dropna(axis=1)
And now we have only the column with complete data – ImageRights. You’ll see that we pass the argument ‘axis=1’ to do look at columns, not rows.
‘.dropna()’ can also take the argument ‘thresh’ to change the amount of missing values you’re happy to deal with. Makho has only 1 missing value, whereas Grey has 2. Below, we’ll allow Makho into our dataset, but continue to exclude Grey:
df.dropna(thresh=2)
Fill data
Sometimes, deleting rows and columns is a bit drastic. You may instead want to simply fill in the gaps instead. Rather than ‘.dropna()’, we can instead ‘.fillna()’, passing the desired value as the argument.
df.fillna(value=0)
That’s a lot cleaner if we want to perform other opearations, great work!
You might want to be a bit smarter than filling with 0s. As an example, you might want to take a column and use the average to fill the rest of the gaps:
df['Wage'].fillna(value=df['Wage'].mean())
Summary
Dealing with datasets can often be unwieldy and messy. Quite commonly, this is caused by missing values. Pandas helps us greatly with ‘.fillna()’ and ‘.dropna()’ and their ability to remove or change missing values with ease.
If you are having issues with mismatching or datasets split across different files, you might want to check out the joining & merging data article.