Connecting to SQLite Databases and Creating Dataframes in Python

At some point, you will come across some really useful data in a format other than a csv. One format might be an SQLite database, containing one or more related data tables. This 5-minute Extra Time article will run through extracting data from SQLite tables into a dataframe for further analysis. Firstly, we will connect to the databse, then look at queries to pull the information out. Finally, we will look at joining information from two tables into one dataframe.

This article will look at connecting to the European Football Database on Kaggle, taking inspiration from many of the kernels on there.

Let’s import the sqlite3 and pandas modules and get started.

In [1]:
import sqlite3
import pandas as pd

Connecting to the database

Firstly, we need to establish a connection to the sqlite file. We do this with the ‘connect()’ function from the sqlite3 module, passing it the location of the file on your machine.

Our database variable holds the location of the file, and the conn variable will be assigned to holding the connection to the database. Using variables like this will make our code easier to refer to later in the article and easier to change in the future.

In [2]:
database = "soccer.sqlite"
conn = sqlite3.connect(database)

Well done, you’ve connected to an sqlite database! We are now set to call for data from it and store it into a dataframe.

Saving data from sqlite3 to a dataframe

This time, we will use pandas to read from our connection with the ‘read_sql’ function – passing the connected database variable and some SQL code that tells us what data to pull. The end result will be a dataframe containing the data that the SQL code calls for.

SQL is a very accessible language that you should definitely spend some time getting a basic handle on. You can find a glossary of the language here.

For readability, let’s assign our SQL code to a variable called query. We will also use a very simple command to call for all of the data from a table called ‘Player’ in the database:

In [3]:
#The * refers to all available data
query = "SELECT * FROM Player"

players = pd.read_sql(query, conn)

players.head()
Out[3]:
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 2 155782 Aaron Cresswell 189615 1989-12-15 00:00:00 170.18 146
2 3 162549 Aaron Doran 186170 1991-05-13 00:00:00 170.18 163
3 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
4 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154

There you have it, a dataframe ready for your analysis from an SQLite database – great work!

As a next step, let’s use another example – this time developing our query a bit. We now, for whatever reason, want to only find players taller than Aaron Cresswell (170.18 cm). Use the ‘WHERE’ keyword to add a clause:

In [4]:
query2 = "SELECT * FROM Player WHERE height > 170.18"

players2 = pd.read_sql(query2, conn)

players2.head()
Out[4]:
id player_api_id player_name player_fifa_api_id birthday height weight
0 1 505942 Aaron Appindangoye 218353 1992-02-29 00:00:00 182.88 187
1 4 30572 Aaron Galindo 140161 1982-05-08 00:00:00 182.88 198
2 5 23780 Aaron Hughes 17725 1979-11-08 00:00:00 182.88 154
3 6 27316 Aaron Hunt 158138 1986-09-04 00:00:00 182.88 161
4 7 564793 Aaron Kuhl 221280 1996-01-30 00:00:00 172.72 146

Joining multiple tables in sqlite3 to one dataframe

Ramping up the complexity, we can also make calls to different tables in the dataframe and join them together into one dataframe.

In short, our logic will be to match the player_api_id seen above, with the same number in a different table. Where these IDs match, we then join the columns from each table together into one dataframe.

Therefore, our query will selectfrom one table – which we note as ‘a’. We will then join this with an inner join (check out other different join types here), and state which columns we are joining on.

Take your time to see how we express this in the query:

In [6]:
query3 = """SELECT * FROM Player_Attributes a
 INNER JOIN (SELECT player_name, birthday, player_api_id AS p_id, height, weight FROM Player)
 b ON a.player_api_id = b.p_id;"""

players3 = pd.read_sql(query3, conn)

players3.head()
Out[6]:
id player_fifa_api_id player_api_id date overall_rating potential preferred_foot attacking_work_rate defensive_work_rate crossing gk_diving gk_handling gk_kicking gk_positioning gk_reflexes player_name birthday p_id height weight
0 1 218353 505942 2016-02-18 00:00:00 67.0 71.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
1 2 218353 505942 2015-11-19 00:00:00 67.0 71.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
2 3 218353 505942 2015-09-21 00:00:00 62.0 66.0 right medium medium 49.0 6.0 11.0 10.0 8.0 8.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
3 4 218353 505942 2015-03-20 00:00:00 61.0 65.0 right medium medium 48.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187
4 5 218353 505942 2007-02-22 00:00:00 61.0 65.0 right medium medium 48.0 5.0 10.0 9.0 7.0 7.0 Aaron Appindangoye 1992-02-29 00:00:00 505942 182.88 187

5 rows × 47 columns

Summary

As with lots of our other tutorial topics, Python takes a complex task and makes it achievable with both brief and accessible language. We have looked at how we initially connect to a sqlite database, how we define SQL queries to pull data from it and even how we can join data from multiple tables into one dataframe.

For next steps, look at how you can dive into the data with our analysis crash course, or learn how to pull Kaggle datasets like the one used above through the Kaggle API.