Dataframes with Pandas

DataFrames power data analysis in Python – they allow us to use grids just like we would conventional spreadsheets. They give us labelled columns and rows, functions, filtering and many more tools to get the most insight and ease of use from our data.

This page will introduce the creation of data frames, a few functions and tools to make selections within them. Let’s import our packages to get started (I’m sure you’ve installed them by now!).

In [1]:
import pandas as pd
import numpy as np

Our table below has a scout report on 4 different players’ shooting, passing and defending skills.

In [2]:
PlayerList = ["Pagbo","Grazemen","Cantay","Ravane"]
SkillList=["Shooting","Passing","Defending"]

#For this example, we have a random number generator for our scout
#I wouldn't recommend this for an actual team
ScoresArray = np.random.randint(1,10,(4,3))

df = pd.DataFrame(data=ScoresArray, index=PlayerList, columns=SkillList)
df
Out[2]:
Shooting Passing Defending
Pagbo 6 7 3
Grazemen 7 3 3
Cantay 2 9 2
Ravane 2 8 5

In this example, dataFrame needs 3 arguments for a fully labelled dataframe – data is the values that make up the body, index goes along the y axis and is the name of each row. Finally, columns runs along the x axis to name the columns.

There are other ways to create dataFrames, but this will serve us perfectly for now.

You’ve come a long way from lists and individual values!

Selecting and indexing

You can probably guess how we select individual values and groups

In [3]:
#Square brackets for columns
#If the result looks familiar, that's because DataFrame columns are series!
df['Shooting']
Out[3]:
Pagbo       6
Grazemen    7
Cantay      2
Ravane      2
Name: Shooting, dtype: int32
In [4]:
#For rows, we use .loc if we use a name
#Turns out that DataFrame rows are also series!
df.loc['Pagbo']
Out[4]:
Shooting     6
Passing      7
Defending    3
Name: Pagbo, dtype: int32
In [5]:
#Or if we use a index number, .iloc
df.iloc[1:3]
Out[5]:
Shooting Passing Defending
Grazemen 7 3 3
Cantay 2 9 2

Creating and removing columns/rows

DataFrames make it really easy for us to be flexible with our datasets. Let’s ask our scout for their thoughts on more players and skills.

In [6]:
#Scout, what about their communication?

df['Communication'] = np.random.randint(1,10,4)
df
Out[6]:
Shooting Passing Defending Communication
Pagbo 6 7 3 5
Grazemen 7 3 3 2
Cantay 2 9 2 3
Ravane 2 8 5 5

To add a new column, we refer to a new column with square brackets, give it a new name then fill it with a series. Remember, our scout uses random numbers as scouting scores.

Our new manager doesn’t care about defending – they want these scores removed from report. The ‘.drop’ method makes this easy:

In [7]:
#axis=1 refers to columns
df = df.drop('Defending',axis=1)
df
Out[7]:
Shooting Passing Communication
Pagbo 6 7 5
Grazemen 7 3 2
Cantay 2 9 3
Ravane 2 8 5

Great job adding and removing columns!

Turns out, though, that our scout didn’t do their homework on the players’ transfer fees. Grazemen is far too expensive and we need to swap him for another player – Mogez.

For rows, we use ‘.drop’ with the axis argument set to 0:

In [8]:
df = df.drop('Grazemen',axis=0)
df
Out[8]:
Shooting Passing Communication
Pagbo 6 7 5
Cantay 2 9 3
Ravane 2 8 5

And to add a new one, we refer to our new row with ‘.loc’ (just like we did to refer to an existing one earlier). We then give this new row the list or series of values. Once again, we just use random numbers to fill the set here.

In [9]:
df.loc['Gomez'] = np.random.randint(1,10,3)
df
Out[9]:
Shooting Passing Communication
Pagbo 6 7 5
Cantay 2 9 3
Ravane 2 8 5
Gomez 6 2 3

Conditonal Selection

In our series, we used a true or false condition to select the data that we wanted to see. We use the exact same logic here. Let’s see where players’ attributes are above 5:

In [10]:
df>5
Out[10]:
Shooting Passing Communication
Pagbo True True False
Cantay False True False
Ravane False True False
Gomez True False False

Cool, we can see which players meet our criteria. You’ll notice that this returns a DataFrame of true or false values. Just like with series, we can use these booleans to return a DataFrame according to our criteria.

Let’s apply this to a column (which we already know is just a series):

In [11]:
df['Shooting']>5
Out[11]:
Pagbo      True
Cantay    False
Ravane    False
Gomez      True
Name: Shooting, dtype: bool

As expected, we have a series of boolean values. If we use square brackets to select our dataframe using these, we will just get the filtered DataFrame.

Therefore, if the coach is asking for players with great shooting skills, we can easily filter our DataFrame.

In [12]:
df[df['Shooting']>5]
Out[12]:
Shooting Passing Communication
Pagbo 6 7 5
Gomez 6 2 3

Summary

Great job getting to understand DataFrames – the tool that will underpin our analysis moving forward.

You have created them, added new rows and columns, then filtered them according to your criteria.

There is lots more to learn about DataFrames, so read around more articles on the topic and learn what you can from other articles that use them.