SQLite is a self-contained, serverless SQL database. Dr. Richard Hipp, the creator of SQLite, first released the software on the 17th of August, 2000. Since then it has gone on to be the second most deployed piece of software in the world. It's used in systems as important as the Airbus A350 so it comes as no surprise the tests for SQLite 3 are aviation-grade. The software itself is very small, the amd64 Debian client and library package is 765 KB when compressed for distribution and 2.3 MB when fully installed. The software is licensed under a very promiscuous license: Public Domain.
Installing the SQLite 3 Client
Though there is no server software needed for SQLite 3, there is a client that allows you to have command line access to the underlying database library. The following installs the entirety of the SQLite 3 offering. I ran the following on a fresh install of Ubuntu 16.04.2 LTS.
$ sudo apt install sqlite3
Customising the Client
You can change the default behaviour of the SQLite 3 CLI by editing the ~/.sqliterc files in your home directory. This is handy for saving settings you find yourself using most often. Here's an example where I turn on headers, set the display mode to column, shorten the prompt, attach a timer to every command and print a special character for NULL values.
$ vi ~/.sqliterc
.headers on .mode column .nullvalue ¤ .prompt "> " .timer on
Importing CSV Data
You can import CSV data into SQLite 3 via two commands. The first changes the client into CSV mode and the second will imports data from a CSV file. The default separator expected is a pipe so if you're using another character you can change that setting via the .separator command.
If the destination table doesn't already exist the first row of the CSV files will be used for naming each of the columns. If the table does exist then all of the rows of data will be appended onto the existing table.
As an example I've put together a CSV file with both ASCII and Japanese characters listing the Airports in Wales in the UK.
$ vi airports.csv
都市,IATA,ICAO,空港 Aberporth,,EGFA,Aberporth 空港 Anglesey,,EGOQ,RAF Mona Anglesey,,EGOV,RAF Valley カナーボン,,EGCK,カナーボン空港 カーディフ,CWL,EGFF,カーディフ国際空港 カーディフ,,EGFC,Tremorfa ヘリポート チェスター,CEG,EGNR,Hawarden 空港 Haverfordwest,HAW,EGFE,Haverfordwest 小型飛行場 Llanbedr,,EGOD,Llanbedr 空港 Pembrey,,EGFP,Pembrey 空港 St Athan,DGX,EGDX,RAF Saint Athan スウォンジ,SWS,EGFH,スウォンジ空港 ウェルシュプール,,EGCW,ウェルシュプール空港
I'll launch the SQLite 3 client with a name of a new database called airports.db. This database file doesn't yet exist so SQLite 3 will automatically create it for me.
$ sqlite3 airports.db
I'll switch the client to CSV mode, state that a comma is the separator and then import the airports.csv file.
.mode csv .separator "," .import airports.csv airports
I can then run the schema command on the new airports table and see two of the column names are in Japanese and two are written using ASCII characters.
CREATE TABLE airports( "都市" TEXT, "IATA" TEXT, "ICAO" TEXT, "空港" TEXT );
I can execute queries mixing the Japanese and ASCII characters together without issue.
$ echo "select ICAO, 空港 from airports;" \ | sqlite3 airports.db
EGFA|Aberporth 空港 EGOQ|RAF Mona EGOV|RAF Valley EGCK|カナーボン空港 EGFF|カーディフ国際空港 EGFC|Tremorfa ヘリポート EGNR|Hawarden 空港 EGFE|Haverfordwest 小型飛行場 EGOD|Llanbedr 空港 EGFP|Pembrey 空港 EGDX|RAF Saint Athan EGFH|スウォンジ空港 EGCW|ウェルシュプール空港
I can also dump the database to SQL using a single command.
$ echo ".dump airports" \ | sqlite3 airports.db
PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE airports( "都市" TEXT, "IATA" TEXT, "ICAO" TEXT, "空港" TEXT ); INSERT INTO "airports" VALUES('Aberporth','','EGFA','Aberporth 空港'); INSERT INTO "airports" VALUES('Anglesey','','EGOQ','RAF Mona'); INSERT INTO "airports" VALUES('Anglesey','','EGOV','RAF Valley'); INSERT INTO "airports" VALUES('カナーボン','','EGCK','カナーボン空港'); INSERT INTO "airports" VALUES('カーディフ','CWL','EGFF','カーディフ国際空港'); INSERT INTO "airports" VALUES('カーディフ','','EGFC','Tremorfa ヘリポート'); INSERT INTO "airports" VALUES('チェスター','CEG','EGNR','Hawarden 空港'); INSERT INTO "airports" VALUES('Haverfordwest','HAW','EGFE','Haverfordwest 小型飛行場'); INSERT INTO "airports" VALUES('Llanbedr','','EGOD','Llanbedr 空港'); INSERT INTO "airports" VALUES('Pembrey','','EGFP','Pembrey 空港'); INSERT INTO "airports" VALUES('St Athan','DGX','EGDX','RAF Saint Athan'); INSERT INTO "airports" VALUES('スウォンジ','SWS','EGFH','スウォンジ空港'); INSERT INTO "airports" VALUES('ウェルシュプール','','EGCW','ウェルシュプール空港'); COMMIT;
Keep in mind the .db files created could be larger than the data in any of its original forms. While writing this blog I produced a million-row, 12-column CSV comprised mostly of integers, floats and one text field. The GZIP-compressed CSV file was 41 MB, the decompressed CSV was 142 MB and when imported into SQLite 3 the .db file was 165 MB. I was able to GZIP-compress the .db file down to 48 MB but the SQLite 3 client cannot open databases that are GZIP-compressed.
Creating a Database in Memory
Data locality can be greatly improved by storing a SQLite 3 database in memory instead of on disk; you have the speed benefits of an in-memory database with the ease of communicating with it using SQL. Below is an example where I calculate 10 Fibonacci values and store them in a SQLite 3 database residing in memory using Python 3.
$ sudo apt install python3 $ python3
import sqlite3 def fib(n): a, b = 0, 1 for _ in range(n): yield a a, b = b, a + b connection = sqlite3.connect(':memory:') cursor = connection.cursor() with connection: cursor.execute('''CREATE TABLE IF NOT EXISTS fib ( calculated_value INTEGER)''') cursor.executemany('INSERT INTO fib VALUES (?)', [(str(x),) for x in fib(10)]) cursor.execute('SELECT * FROM fib') print(cursor.fetchall()) connection.close()
The above table creation and insert commands are executed within the context of the connection which has the benefit of wrapping the commands in a power-safe, atomic database transaction.
User Defined Functions
You can create user-defined functions in Python that can be executed on data sitting inside a SQLite 3 database. Below I'll create a small SQLite 3 database:
$ sqlite3 urls.db
CREATE TABLE urls (url STRING); INSERT INTO urls VALUES ('https://packages.debian.org/stretch/sqlite3'), ('https://docs.python.org/3/library/sqlite3.html'), ('https://sqlite.org/about.html');
Then I'll create a function in Python that extracts the hostname from a URL and execute it on all the records in the urls table in the database.
import sqlite3 from urllib.parse import urlsplit def hostname(url): return urlsplit(url).netloc connection = sqlite3.connect('urls.db') connection.create_function('hostname', 1, hostname) cursor = connection.cursor() cursor.execute('SELECT hostname(url) FROM urls') print(cursor.fetchall())
The following is the output of the fetchall call.
[(u'packages.debian.org',), (u'docs.python.org',), (u'sqlite.org',)]
Working with multiple databases
The SQLite 3 client is capable of working with multiple databases in a single session. Below I'll launch the client and then attach two databases to the session.
ATTACH 'airports.db' AS airport; ATTACH 'urls.db' AS urls;
Then I'll run the .databases command to output the names and locations of the databases attached to the session.
seq name file --- --------------- ----------------------- 0 main 2 airport /home/mark/airports.db 3 urls /home/mark/urls.db
I can prefix the table names in my queries with the name I've assigned the database.
SELECT COUNT(*) FROM urls.urls;
Visualising Data with Jupyter Notebooks
Jupyter Notebooks are a popular way to view data visualisations. Below I'll walk through the setup and a few minimalist examples of various visualisations run on an example dataset.
First, I'll install a number of system dependencies.
$ sudo apt update $ sudo apt install \ libgeos-dev \ python3-dev \ python3-pip \ python3-tk \ python3-venv
Then I'll create a Python virtual environment so I can isolate my Python dependencies from other projects on my machine. This environment will be called .taxis.
$ pyvenv .taxis $ source .taxis/bin/activate
I'll upgrade the Python package manager "pip" from version 8.1.1, which comes from the installation command above, to version 9.0.1 in this virtual environment.
$ pip install --upgrade pip
I'll then install a number of popular data-related libraries written for Python.
$ pip install \ https://github.com/matplotlib/basemap/archive/v1.1.0.tar.gz \ 'bokeh<0.12.4' \ gmaps \ 'holoviews[extras]' \ jupyter \ pandas \ Pillow
Jupyter Notebooks will expose the working folder on my Linux machine via HTTP so I'll create an isolated working folder away from my other files.
$ mkdir -p ~/jupyter-working $ cd ~/jupyter-working
I'll then enable the gmaps extention and allow for Jupyter to use widgets.
$ jupyter nbextension enable --py --sys-prefix gmaps $ jupyter nbextension enable --py widgetsnbextension
Then I'll launch the Notebook server. Once launched, a URL will be given that contains a token parameter. Paste that entire URL into your web browser in order to bring up the Notebooks interface.
$ jupyter notebook \ --ip=0.0.0.0 \ --NotebookApp.iopub_data_rate_limit=100000000
... Copy/paste this URL into your browser when you connect for the first time, to login with a token: http://0.0.0.0:8888/?token=123...
Before opening the URL I'll create a SQLite 3 database from a CSV file. This dataset is made up of one million randomly-picked records from the 1.1 billion taxi ride dataset I produced in my Billion Taxi Rides in Redshift blog post. I used the following to export the one million record sampling from Hive.
$ hive -e 'SET hive.cli.print.header=true; SELECT trip_id, cab_type, passenger_count, trip_distance, fare_amount, tip_amount, pickup_datetime, dropoff_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude FROM trips WHERE RAND() <= 0.001 DISTRIBUTE BY RAND() SORT BY RAND() LIMIT 1000000' \ | sed 's/[\t]/,/g' \ | gzip \ > trips.csv.gz
The Spark 2.2 & 3 Raspberry Pi 3 Model Bs blog post has the most concise instructions for importing the dataset into Hive. If you're using those instructions on anything other than Raspbian please note the packages names, such as the one for JDK, will probably differ. Note that exporting from CSV-stored data rather than ORC appears to complete 2-3x faster.
Here are the first three lines of that CSV file. Note that the first line contains the column names.
$ gunzip -c trips.csv.gz | head -n3
trip_id,cab_type,passenger_count,trip_distance,fare_amount,tip_amount,pickup_datetime,dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude 745713518,yellow,1,5.600,20.50,1.00,2013-04-30 13:43:58,2013-04-30 14:04:49,-73.94273100000000,40.79017800000000,-74.00244499999999,40.76083900000000 788379509,yellow,1,1.200,6.00,0.00,2013-07-07 12:24:33,2013-07-07 12:28:52,-73.95807200000000,40.76124600000000,-73.94632400000000,40.77708900000000
I'll decompress the GZIP'ed file and launch SQLite 3. I've added trips.db as a parameter so it'll be created by SQLite 3.
$ gunzip trips.csv.gz $ sqlite3 trips.db
I'll then switch into CSV mode, make sure the separator is a comma and import the CSV file into the trips table.
.mode csv .separator "," .import trips.csv trips
With the data imported I'll open the Notebook URL and create Python 3 Notebook in Jupyter's Web UI. I'll then paste the following requirements into the first cell and then hit shift and enter at the same time to "execute" the cell.
import sqlite3 import pandas as pd import holoviews as hv hv.extension('bokeh') connection = sqlite3.connect('trips.db')
The above will import the Python library for SQLite 3, Pandas, a data manipulation library, Holoviews, a visualisation library and initialise the Bokeh extention for Holoviews. Finally, a connection to the SQLite 3 database with the taxi trips dataset will be established.
In the following example I'll paste in a code sniplet that will produce a heat map showing a breakdown by weekday and hour of the number of taxi journeys in my taxi rides dataset.
%%opts Points [tools=['hover']] (size=5) HeatMap [tools=['hover']] Histogram [tools=['hover']] Layout [shared_axes=False] sql = """SELECT strftime('%w', pickup_datetime) as weekday, strftime('%H', pickup_datetime) as hour, COUNT(*) as cnt FROM trips GROUP BY 1, 2;""" df = pd.read_sql_query(sql, connection) hv.HeatMap(df)
The following will produce a line chart showing the number of taxi trips by date in the dataset.
%matplotlib inline sql = """SELECT date(pickup_datetime) as date, COUNT(*) as cnt FROM trips GROUP BY 1 ORDER BY 1;""" df = pd.read_sql_query(sql, connection) df['date'] = df.date.astype('datetime64[ns]') df.plot(x='date', y='cnt')
For a stacked bar chart comparing yellow and green cab trip counts against one another by month of the year enter the following into a new cell.
%%opts Bars [stack_index=1 xrotation=90 legend_cols=7 show_legend=False show_frame=False tools=['hover']] hv.extension('bokeh', 'matplotlib') sql = """SELECT strftime('%m', pickup_datetime) as month, cab_type, COUNT(*) as cnt FROM trips GROUP BY 1, 2;""" df = pd.read_sql_query(sql, connection) table = hv.Table(df, kdims=[('month', 'month'), ('cab_type', 'cab_type')], vdims=[('cnt', 'cnt')]) table.to.bars(['month', 'cab_type'], 'cnt', )
The following will produce a pie chart showing the proportion of trips based on the hour of the day.
%matplotlib inline sql = """SELECT round(strftime('%H', pickup_datetime)) as hour, count(*) as cnt FROM trips group by 1;""" df = pd.read_sql_query(sql, connection) df.plot(kind='pie', y='cnt', legend=False)
To produce a scatter plot matrix using four numeric fields paste in the following. Note, this could take a few minutes to render. You should see an array output early on followed by the plot much later on.
%matplotlib inline sql = """SELECT round(strftime('%H', pickup_datetime)) as hour, round(trip_distance), round(fare_amount), round(tip_amount) FROM trips;""" df = pd.read_sql_query(sql, connection) pd.plotting.scatter_matrix(df, figsize=(15, 15))
I came across two ways of rendering geographic points on maps. The first is with Matplotlib and Basemap which will work offline and without any need for API keys. The following will plot the pick up points for the taxi journeys in the dataset.
%matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap sql = """SELECT ROUND(pickup_longitude, 3) as long, ROUND(pickup_latitude, 3) as lat, COUNT(*) as cnt FROM trips GROUP BY long, lat""" df = pd.read_sql_query(sql, connection) df = df[pd.to_numeric(df['long'], errors='coerce').notnull()] df = df[pd.to_numeric(df['lat'], errors='coerce').notnull()] df = df.dropna(thresh=1) df.long = df.long.astype(float, errors='ignore').fillna(0.0) df.lat = df.lat.astype(float, errors='ignore').fillna(0.0) plt.figure(figsize=(20, 20)) map = Basemap(projection='merc', llcrnrlat=40, urcrnrlat=42, llcrnrlon=-75, urcrnrlon=-72, resolution='i', area_thresh=50, lat_0=40.78, lon_0=-73.96) map.drawcountries() map.drawcoastlines(linewidth=0.5) map.drawstates() map.bluemarble() lons = df['long'].values lats = df['lat'].values x, y = map(lons, lats) map.plot(x, y, 'ro', markersize=4) plt.show()
The above will look somewhat primitive and won't be interactive.
The following will plot a heatmap on top of a Google Maps Widget. The upsides include the boilerplate code is significantly less than the above, the map itself looks great and comes with the various navigational controls. The downside is that you'll need to create an API key with Google in order to use this module and you'll need to be connected to the internet when you use it.
The other issue is that if there are any latitude/longitude pairs that aren't valid you'll get an error message back rather than the erroneous pairs being ignored. It's rare that a dataset is in perfect condition so you may need to spend some time filtering out bad values.
Note that in this simplistic example I haven't weighed the points based on the aggregated counts for each point. You will see hotspots over Manhattan and NYC's airports but this won't be a completely accurate picture of where all the hotspots are.
import gmaps gmaps.configure(api_key="...") locations = [(float(row['lat']), float(row['long'])) for index, row in df.iterrows() if -80 < float(row['long']) < -70 and 35 < float(row['lat']) < 45] fig = gmaps.Map() fig.add_layer(gmaps.heatmap_layer(locations)) fig
The Python Data Visualisation landscape is very broad, diverse and fragmented at the moment. It's a area of Python I expect to consolidate and mature a lot in the next few years. Jake VanderPlas gave an excellent talk on the subject at PyCon 2017 in Portland, its well worth 30 minutes of your time.
Dumping Pandas DataFrames to SQLite
Pandas DataFrames are great for creating derivative datasets with a minimal amount of code. To top that off, dumping Pandas DataFrames back to SQLite 3 is very straight forward. In this example I'll populate a DataFrame with some CSV data, create a new SQLite 3 database and dump the DataFrame out to that file.
import sqlite3 import pandas connection = sqlite3.connect('trips.db') df = pandas.read_csv('trips.csv', sep=',') df.to_sql('trips', connection, if_exists='append', index=False)
SQLite 3 isn't a toy. It's a powerful SQL-language interface enabler. As storage speeds and single core performance in CPUs increase the amount of data that can be sensibly handled within a reasonable time by SQLite 3 continues to grow.
SQLite 3 is also a very good educational tool. If someone is coming from a closed-source data practitioner background and they're trying to get their head around the open source data science world on Linux (or Windows or Mac OS for that matter) then SQLite 3 provides very little friction to that learning experience.
There isn't one database that solves every problem but there are databases that are very good at solving many problems. I certainly consider SQLite 3 to be one of those databases and I see it as a valuable tool in my toolbox.