DataFrames power data analysis in Python – they allow us to use grids just like we would conventional spreadsheets. They give us labelled columns and rows, functions, filtering and many more tools to get the most insight and ease of use from our data.
This page will introduce the creation of data frames, a few functions and tools to make selections within them. Let’s import our packages to get started (I’m sure you’ve installed them by now!).
import pandas as pd
import numpy as np
Our table below has a scout report on 4 different players’ shooting, passing and defending skills.
PlayerList = ["Pagbo","Grazemen","Cantay","Ravane"]
SkillList=["Shooting","Passing","Defending"]
#For this example, we have a random number generator for our scout
#I wouldn't recommend this for an actual team
ScoresArray = np.random.randint(1,10,(4,3))
df = pd.DataFrame(data=ScoresArray, index=PlayerList, columns=SkillList)
df
In this example, dataFrame needs 3 arguments for a fully labelled dataframe – data is the values that make up the body, index goes along the y axis and is the name of each row. Finally, columns runs along the x axis to name the columns.
There are other ways to create dataFrames, but this will serve us perfectly for now.
You’ve come a long way from lists and individual values!
Selecting and indexing
You can probably guess how we select individual values and groups
#Square brackets for columns
#If the result looks familiar, that's because DataFrame columns are series!
df['Shooting']
#For rows, we use .loc if we use a name
#Turns out that DataFrame rows are also series!
df.loc['Pagbo']
#Or if we use a index number, .iloc
df.iloc[1:3]
Creating and removing columns/rows
DataFrames make it really easy for us to be flexible with our datasets. Let’s ask our scout for their thoughts on more players and skills.
#Scout, what about their communication?
df['Communication'] = np.random.randint(1,10,4)
df
To add a new column, we refer to a new column with square brackets, give it a new name then fill it with a series. Remember, our scout uses random numbers as scouting scores.
Our new manager doesn’t care about defending – they want these scores removed from report. The ‘.drop’ method makes this easy:
#axis=1 refers to columns
df = df.drop('Defending',axis=1)
df
Great job adding and removing columns!
Turns out, though, that our scout didn’t do their homework on the players’ transfer fees. Grazemen is far too expensive and we need to swap him for another player – Mogez.
For rows, we use ‘.drop’ with the axis argument set to 0:
df = df.drop('Grazemen',axis=0)
df
And to add a new one, we refer to our new row with ‘.loc’ (just like we did to refer to an existing one earlier). We then give this new row the list or series of values. Once again, we just use random numbers to fill the set here.
df.loc['Gomez'] = np.random.randint(1,10,3)
df
Conditonal Selection
In our series, we used a true or false condition to select the data that we wanted to see. We use the exact same logic here. Let’s see where players’ attributes are above 5:
df>5
Cool, we can see which players meet our criteria. You’ll notice that this returns a DataFrame of true or false values. Just like with series, we can use these booleans to return a DataFrame according to our criteria.
Let’s apply this to a column (which we already know is just a series):
df['Shooting']>5
As expected, we have a series of boolean values. If we use square brackets to select our dataframe using these, we will just get the filtered DataFrame.
Therefore, if the coach is asking for players with great shooting skills, we can easily filter our DataFrame.
df[df['Shooting']>5]
Summary
Great job getting to understand DataFrames – the tool that will underpin our analysis moving forward.
You have created them, added new rows and columns, then filtered them according to your criteria.
There is lots more to learn about DataFrames, so read around more articles on the topic and learn what you can from other articles that use them.