At some point, you will come across some really useful data in a format other than a csv. One format might be an SQLite database, containing one or more related data tables. This 5-minute Extra Time article will run through extracting data from SQLite tables into a dataframe for further analysis. Firstly, we will connect to the databse, then look at queries to pull the information out. Finally, we will look at joining information from two tables into one dataframe.
This article will look at connecting to the European Football Database on Kaggle, taking inspiration from many of the kernels on there.
Let’s import the sqlite3 and pandas modules and get started.
import sqlite3
import pandas as pd
Connecting to the database
Firstly, we need to establish a connection to the sqlite file. We do this with the ‘connect()’ function from the sqlite3 module, passing it the location of the file on your machine.
Our database variable holds the location of the file, and the conn variable will be assigned to holding the connection to the database. Using variables like this will make our code easier to refer to later in the article and easier to change in the future.
database = "soccer.sqlite"
conn = sqlite3.connect(database)
Well done, you’ve connected to an sqlite database! We are now set to call for data from it and store it into a dataframe.
Saving data from sqlite3 to a dataframe
This time, we will use pandas to read from our connection with the ‘read_sql’ function – passing the connected database variable and some SQL code that tells us what data to pull. The end result will be a dataframe containing the data that the SQL code calls for.
SQL is a very accessible language that you should definitely spend some time getting a basic handle on. You can find a glossary of the language here.
For readability, let’s assign our SQL code to a variable called query. We will also use a very simple command to call for all of the data from a table called ‘Player’ in the database:
#The * refers to all available data
query = "SELECT * FROM Player"
players = pd.read_sql(query, conn)
players.head()
There you have it, a dataframe ready for your analysis from an SQLite database – great work!
As a next step, let’s use another example – this time developing our query a bit. We now, for whatever reason, want to only find players taller than Aaron Cresswell (170.18 cm). Use the ‘WHERE’ keyword to add a clause:
query2 = "SELECT * FROM Player WHERE height > 170.18"
players2 = pd.read_sql(query2, conn)
players2.head()
Joining multiple tables in sqlite3 to one dataframe
Ramping up the complexity, we can also make calls to different tables in the dataframe and join them together into one dataframe.
In short, our logic will be to match the player_api_id seen above, with the same number in a different table. Where these IDs match, we then join the columns from each table together into one dataframe.
Therefore, our query will selectfrom one table – which we note as ‘a’. We will then join this with an inner join (check out other different join types here), and state which columns we are joining on.
Take your time to see how we express this in the query:
query3 = """SELECT * FROM Player_Attributes a
INNER JOIN (SELECT player_name, birthday, player_api_id AS p_id, height, weight FROM Player)
b ON a.player_api_id = b.p_id;"""
players3 = pd.read_sql(query3, conn)
players3.head()
Summary
As with lots of our other tutorial topics, Python takes a complex task and makes it achievable with both brief and accessible language. We have looked at how we initially connect to a sqlite database, how we define SQL queries to pull data from it and even how we can join data from multiple tables into one dataframe.
For next steps, look at how you can dive into the data with our analysis crash course, or learn how to pull Kaggle datasets like the one used above through the Kaggle API.