At some point, you will come across some really useful data in a format other than a csv. One format might be an SQLite database, containing one or more related data tables. This 5-minute Extra Time article will run through extracting data from SQLite tables into a dataframe for further analysis. Firstly, we will connect to the databse, then look at queries to pull the information out. Finally, we will look at joining information from two tables into one dataframe.

This article will look at connecting to the European Football Database on Kaggle, taking inspiration from many of the kernels on there.

Let’s import the sqlite3 and pandas modules and get started.

In [1]:

import sqlite3
import pandas as pd

Connecting to the database

Firstly, we need to establish a connection to the sqlite file. We do this with the ‘connect()’ function from the sqlite3 module, passing it the location of the file on your machine.

Our database variable holds the location of the file, and the conn variable will be assigned to holding the connection to the database. Using variables like this will make our code easier to refer to later in the article and easier to change in the future.

In [2]:

database = "soccer.sqlite"
conn = sqlite3.connect(database)

Well done, you’ve connected to an sqlite database! We are now set to call for data from it and store it into a dataframe.

Saving data from sqlite3 to a dataframe

This time, we will use pandas to read from our connection with the ‘read_sql’ function – passing the connected database variable and some SQL code that tells us what data to pull. The end result will be a dataframe containing the data that the SQL code calls for.

SQL is a very accessible language that you should definitely spend some time getting a basic handle on. You can find a glossary of the language here.

For readability, let’s assign our SQL code to a variable called query. We will also use a very simple command to call for all of the data from a table called ‘Player’ in the database:

In [3]:

#The * refers to all available data
query = "SELECT * FROM Player"

players = pd.read_sql(query, conn)

players.head()

Out[3]:

	id	player_api_id	player_name	player_fifa_api_id	birthday	height	weight
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187
1	2	155782	Aaron Cresswell	189615	1989-12-15 00:00:00	170.18	146
2	3	162549	Aaron Doran	186170	1991-05-13 00:00:00	170.18	163
3	4	30572	Aaron Galindo	140161	1982-05-08 00:00:00	182.88	198
4	5	23780	Aaron Hughes	17725	1979-11-08 00:00:00	182.88	154

There you have it, a dataframe ready for your analysis from an SQLite database – great work!

As a next step, let’s use another example – this time developing our query a bit. We now, for whatever reason, want to only find players taller than Aaron Cresswell (170.18 cm). Use the ‘WHERE’ keyword to add a clause:

In [4]:

query2 = "SELECT * FROM Player WHERE height > 170.18"

players2 = pd.read_sql(query2, conn)

players2.head()

Out[4]:

	id	player_api_id	player_name	player_fifa_api_id	birthday	height	weight
0	1	505942	Aaron Appindangoye	218353	1992-02-29 00:00:00	182.88	187
1	4	30572	Aaron Galindo	140161	1982-05-08 00:00:00	182.88	198
2	5	23780	Aaron Hughes	17725	1979-11-08 00:00:00	182.88	154
3	6	27316	Aaron Hunt	158138	1986-09-04 00:00:00	182.88	161
4	7	564793	Aaron Kuhl	221280	1996-01-30 00:00:00	172.72	146

Joining multiple tables in sqlite3 to one dataframe

Ramping up the complexity, we can also make calls to different tables in the dataframe and join them together into one dataframe.

In short, our logic will be to match the player_api_id seen above, with the same number in a different table. Where these IDs match, we then join the columns from each table together into one dataframe.

Therefore, our query will selectfrom one table – which we note as ‘a’. We will then join this with an inner join (check out other different join types here), and state which columns we are joining on.

Take your time to see how we express this in the query:

In [6]:

query3 = """SELECT * FROM Player_Attributes a
 INNER JOIN (SELECT player_name, birthday, player_api_id AS p_id, height, weight FROM Player)
 b ON a.player_api_id = b.p_id;"""

players3 = pd.read_sql(query3, conn)

players3.head()

Out[6]:

	id	player_fifa_api_id	player_api_id	date	overall_rating	potential	preferred_foot	attacking_work_rate	defensive_work_rate	crossing	…	gk_diving	gk_handling	gk_kicking	gk_positioning	gk_reflexes	player_name	birthday	p_id	height	weight
0	1	218353	505942	2016-02-18 00:00:00	67.0	71.0	right	medium	medium	49.0	…	6.0	11.0	10.0	8.0	8.0	Aaron Appindangoye	1992-02-29 00:00:00	505942	182.88	187
1	2	218353	505942	2015-11-19 00:00:00	67.0	71.0	right	medium	medium	49.0	…	6.0	11.0	10.0	8.0	8.0	Aaron Appindangoye	1992-02-29 00:00:00	505942	182.88	187
2	3	218353	505942	2015-09-21 00:00:00	62.0	66.0	right	medium	medium	49.0	…	6.0	11.0	10.0	8.0	8.0	Aaron Appindangoye	1992-02-29 00:00:00	505942	182.88	187
3	4	218353	505942	2015-03-20 00:00:00	61.0	65.0	right	medium	medium	48.0	…	5.0	10.0	9.0	7.0	7.0	Aaron Appindangoye	1992-02-29 00:00:00	505942	182.88	187
4	5	218353	505942	2007-02-22 00:00:00	61.0	65.0	right	medium	medium	48.0	…	5.0	10.0	9.0	7.0	7.0	Aaron Appindangoye	1992-02-29 00:00:00	505942	182.88	187

5 rows × 47 columns

Summary

As with lots of our other tutorial topics, Python takes a complex task and makes it achievable with both brief and accessible language. We have looked at how we initially connect to a sqlite database, how we define SQL queries to pull data from it and even how we can join data from multiple tables into one dataframe.

For next steps, look at how you can dive into the data with our analysis crash course, or learn how to pull Kaggle datasets like the one used above through the Kaggle API.

Pandas SQL SQLite