Home | Benchmarks | Categories | Atom Feed

Posted on Sun 11 July 2021 under Data Science

Data Fluent for PostgreSQL

For a dataset to be useful, it needs to be understood. From a data engineering perspective, having the number of rows, columns and disk requirements broken down by table allows you to understand potential hardware requirements for any given workload and often understand which tables are dimensional tables and which are fact tables.

Having a row count broken down by month across all date columns will give you an idea of the growth and amount of change within a given dataset.

The above is a part of any discovery piece I do for my clients. In the past, I'd generate an Excel sheet using iPython and a few packages. But last week, I released Data Fluent for PostgreSQL, a Python package that takes that workflow and makes it open source under an MIT license and freely available via PyPI and GitHub. This software helps you build a better understanding of your data in PostgreSQL.

Below are two screenshots of the Excel-based reports it can generate.

Data Fluent's Metrics Worksheet

The above shows seven tables within a PostgreSQL database. Each table's column, row and byte count is given. There is an additional disk space requirements column where the byte count is converted into a human-readable form.

Since the above screenshot was taken, the human-readable size has been converted from base 10 to base 2. This means 3,956,736 bytes will read as 3.77 MiB instead of 3.96 MB.

Data Fluent's Time Distribution Worksheet

The above shows the row count broken down by year and month for every date and timestamp field within the seven tables.

In this post, I'll walk through an example workload and discuss the 3rd-party libraries that this package glues together.

Installing Data Fluent

If you are running Ubuntu 20, the following will install the latest stable version of PostgreSQL, version 13, Python 3.8.5 and Git.

$ wget -qO- \
    https://www.postgresql.org/media/keys/ACCC4CF8.asc \
        | sudo apt-key add -
$ echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" \
    | sudo tee /etc/apt/sources.list.d/pgdg.list

$ sudo apt update
$ sudo apt install \
    git \
    python3-pip \
    python3-virtualenv \
    postgresql-13 \
    postgresql-client-13

If you want to run this software on macOS instead, after installing brew, a popular command-line package manager for macOS, run the following.

$ brew install \
    git \
    postgresql \
    virtualenv

Note, virtualenv has a Python 3 dependency so it'll install Python as well if it isn't already present.

Then, regardless of platform, set up a virtual environment for Data Fluent. The virtual environment will ensure its Python dependencies won't interfere with other Python-based projects on your system.

$ virtualenv ~/.fluency
$ source ~/.fluency/bin/activate

With the virtual environment set up and activated, use Python's built-in package manager pip to install Data Fluent.

$ python3 -m pip install datafluent

Analysing Data with Data Fluent

FiveThirtyEight is a journalism website that focused on opinion poll analysis, politics, economics and sports blogging. They often share their data on GitHub and can make for good example datasets for trying out new tooling.

The following will clone their data repository.

$ git clone https://github.com/fivethirtyeight/data.git ~/538data

Make sure you can access a PostgreSQL database on your machine. Below I'll grant access to my account on my Ubuntu 20 machine. Please adjust the username and password for your system.

$ sudo -u postgres \
    bash -c "psql -c \"CREATE USER mark
                       WITH PASSWORD 'test'
                       SUPERUSER;\""

With access setup, I've created a PostgreSQL database called intel.

$ createdb intel

I'll install csvkit, which includes csvsql, a tool that can read CSV files, infer the most granular data type of each column, and populate the data into a table.

To install Ubuntu 20's dependencies for csvkit run:

$ sudo apt install \
    libicu-dev \
    pkg-config

To install macOS's dependencies for csvkit run:

$ brew install icu4c

Then, regardless of platform, run:

$ python3 -m pip install csvkit

I'll import one of the datasets within FiveThirtyEight's repository. Note, because the dates within this dataset are not formatted in YYYY-MM-DD format, I needed to override the date parsing so that the MM/DD/YYYY format would be inferred by csvsql properly.

$ csvsql --db postgresql:///intel \
         --insert ~/538data/congress-generic-ballot/generic_topline_historical.csv \
         --datetime-format="%m/%d/%Y"

The Database URL will be parsed by sqlalchemy. For an explanation of the syntax, please see their Database URLs documentation.

I'll then run Data Fluent which will generate a report in Excel format.

$ datafluent --url postgresql:///intel

The above will produce a fluency.xlsx file with two worksheets: Metrics and Time Distributions. This report will resemble the screenshots at the top of this post.

Data Fluent's Underlying Packages

As of this writing, Data Fluent is a little over 100 lines of Python code. This is down to heavy 3rd-party library usage.

SQLAlchemy, a SQL toolkit for Python, is used for all communications with PostgreSQL. It's one of the older projects used, having been started by Mike Bayer, who now works for Red Hat, back in 2005.

There are six PostgreSQL drivers that are supported by SQLAlchemy, I've chosen psycopg2-binary for the default dependency in Data Fluent. This package is a libpq wrapper written in ~13K lines of C primarily by Daniele Varrazzo. It has excellent support for Python 3, Unicode, concurrency, asynchronous communication and handling a large number of cursors.

Christopher Groskopf began the csvkit project back in 2011. He also took on a role at FiveThirtyEight in 2019. The csvkit project is full of very useful CLI-based tools for dealing with CSV files and has been stable from its earliest releases. Data Fluent doesn't name csvkit as a direct dependency but it is instrumental in importing CSV files into PostgreSQL with minimal fuss. Amazingly, csvkit is made up of less than 4K lines of Python.

Much of the data wrangling happens using Pandas. Wes McKinney started the project back in 2009 and it is largely responsible for the popularity of Python as a language for data science. In JetBrain's 2020 Python Survey, data analysis was the top reason why their respondents use Python and Pandas was the second most used data science framework. Pandas' 125K of Python and C code produced by almost 2,400 developers has created a moat for Python in the Data World.

Pandas uses John McNamara's xlsxwriter package which generates Excel files like the one shown at the top of this post. Modern Excel files are made up of XML files wrapped in a PKZip container and compressed using DEFLATE. This makes them easier to generate than the older Excel format Microsoft used prior to 2007. Excluding tests and examples, xlsxwriter is made up of ~12,500 lines of Python.

Data Fluent's CLI is produced using typer, a wrapper that sits above Click, a CLI toolkit for Python. Click has been around for some time and supports not just Python 3 but Python 2.7 as well. There are new features in recent releases of Python 3 which allow typer to create a much more terse API. The typer project was only started in December of 2019 and development finished in less than a year but the resulting ~1,800 lines of Python have made a world of difference to the terseness of any CLI project I've kicked off in the last year.

Sebastián Ramírez, the primary developer of typer, is also the creator of FastAPI. FastAPI was listed as the third most popular Python Web Framework in JetBrain's 2020 survey. This despite the project only kicking off in December of 2018. It's great to see newcomers pushing the quality of Python packages forward so quickly.

Finally, humanfriendly, a Python package developed by Peter Odding-Thomson, is used for producing human-readable byte sizes. I've used this package since the early 2010s when I was doing a lot of Django projects. The project itself is less than 3K lines of code but is a massive time saver.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2021 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.