For a dataset to be useful, it needs to be understood. From a data engineering perspective, having the number of rows, columns and disk requirements broken down by table allows you to understand potential hardware requirements for any given workload and often understand which tables are dimensional tables and which are fact tables.
Having a row count broken down by month across all date columns will give you an idea of the growth and amount of change within a given dataset.
The above is a part of any discovery piece I do for my clients. In the past, I'd generate an Excel sheet using iPython and a few packages. But last week, I released Data Fluent for PostgreSQL, a Python package that takes that workflow and makes it open source under an MIT license and freely available via PyPI and GitHub. This software helps you build a better understanding of your data in PostgreSQL.
Below are two screenshots of the Excel-based reports it can generate.
The above shows seven tables within a PostgreSQL database. Each table's column, row and byte count is given. There is an additional disk space requirements column where the byte count is converted into a human-readable form.
Since the above screenshot was taken, the human-readable size has been converted from base 10 to base 2. This means 3,956,736 bytes will read as 3.77 MiB instead of 3.96 MB.
The above shows the row count broken down by year and month for every date and timestamp field within the seven tables.
In this post, I'll walk through an example workload and discuss the 3rd-party libraries that this package glues together.
Installing Data Fluent
If you are running Ubuntu 20, the following will install the latest stable version of PostgreSQL, version 13, Python 3.8.5 and Git.
$ wget -qO- \ https://www.postgresql.org/media/keys/ACCC4CF8.asc \ | sudo apt-key add - $ echo "deb http://apt.postgresql.org/pub/repos/apt/ xenial-pgdg main" \ | sudo tee /etc/apt/sources.list.d/pgdg.list $ sudo apt update $ sudo apt install \ git \ python3-pip \ python3-virtualenv \ postgresql-13 \ postgresql-client-13
If you want to run this software on macOS instead, after installing brew, a popular command-line package manager for macOS, run the following.
$ brew install \ git \ postgresql \ virtualenv
Note, virtualenv has a Python 3 dependency so it'll install Python as well if it isn't already present.
Then, regardless of platform, set up a virtual environment for Data Fluent. The virtual environment will ensure its Python dependencies won't interfere with other Python-based projects on your system.
$ virtualenv ~/.fluency $ source ~/.fluency/bin/activate
With the virtual environment set up and activated, use Python's built-in package manager pip to install Data Fluent.
$ python3 -m pip install datafluent
Analysing Data with Data Fluent
FiveThirtyEight is a journalism website that focused on opinion poll analysis, politics, economics and sports blogging. They often share their data on GitHub and can make for good example datasets for trying out new tooling.
The following will clone their data repository.
$ git clone https://github.com/fivethirtyeight/data.git ~/538data
Make sure you can access a PostgreSQL database on your machine. Below I'll grant access to my account on my Ubuntu 20 machine. Please adjust the username and password for your system.
$ sudo -u postgres \ bash -c "psql -c \"CREATE USER mark WITH PASSWORD 'test' SUPERUSER;\""
With access setup, I've created a PostgreSQL database called intel.
$ createdb intel
To install Ubuntu 20's dependencies for csvkit run:
$ sudo apt install \ libicu-dev \ pkg-config
To install macOS's dependencies for csvkit run:
$ brew install icu4c
Then, regardless of platform, run:
$ python3 -m pip install csvkit
I'll import one of the datasets within FiveThirtyEight's repository. Note, because the dates within this dataset are not formatted in YYYY-MM-DD format, I needed to override the date parsing so that the MM/DD/YYYY format would be inferred by csvsql properly.
$ csvsql --db postgresql:///intel \ --insert ~/538data/congress-generic-ballot/generic_topline_historical.csv \ --datetime-format="%m/%d/%Y"
I'll then run Data Fluent which will generate a report in Excel format.
$ datafluent --url postgresql:///intel
The above will produce a fluency.xlsx file with two worksheets: Metrics and Time Distributions. This report will resemble the screenshots at the top of this post.
Data Fluent's Underlying Packages
As of this writing, Data Fluent is a little over 100 lines of Python code. This is down to heavy 3rd-party library usage.
There are six PostgreSQL drivers that are supported by SQLAlchemy, I've chosen psycopg2-binary for the default dependency in Data Fluent. This package is a libpq wrapper written in ~13K lines of C primarily by Daniele Varrazzo. It has excellent support for Python 3, Unicode, concurrency, asynchronous communication and handling a large number of cursors.
Christopher Groskopf began the csvkit project back in 2011. He also took on a role at FiveThirtyEight in 2019. The csvkit project is full of very useful CLI-based tools for dealing with CSV files and has been stable from its earliest releases. Data Fluent doesn't name csvkit as a direct dependency but it is instrumental in importing CSV files into PostgreSQL with minimal fuss. Amazingly, csvkit is made up of less than 4K lines of Python.
Much of the data wrangling happens using Pandas. Wes McKinney started the project back in 2009 and it is largely responsible for the popularity of Python as a language for data science. In JetBrain's 2020 Python Survey, data analysis was the top reason why their respondents use Python and Pandas was the second most used data science framework. Pandas' 125K of Python and C code produced by almost 2,400 developers has created a moat for Python in the Data World.
Pandas uses John McNamara's xlsxwriter package which generates Excel files like the one shown at the top of this post. Modern Excel files are made up of XML files wrapped in a PKZip container and compressed using DEFLATE. This makes them easier to generate than the older Excel format Microsoft used prior to 2007. Excluding tests and examples, xlsxwriter is made up of ~12,500 lines of Python.
Data Fluent's CLI is produced using typer, a wrapper that sits above Click, a CLI toolkit for Python. Click has been around for some time and supports not just Python 3 but Python 2.7 as well. There are new features in recent releases of Python 3 which allow typer to create a much more terse API. The typer project was only started in December of 2019 and development finished in less than a year but the resulting ~1,800 lines of Python have made a world of difference to the terseness of any CLI project I've kicked off in the last year.
Sebastián Ramírez, the primary developer of typer, is also the creator of FastAPI. FastAPI was listed as the third most popular Python Web Framework in JetBrain's 2020 survey. This despite the project only kicking off in December of 2018. It's great to see newcomers pushing the quality of Python packages forward so quickly.
Finally, humanfriendly, a Python package developed by Peter Odding-Thomson, is used for producing human-readable byte sizes. I've used this package since the early 2010s when I was doing a lot of Django projects. The project itself is less than 3K lines of code but is a massive time saver.