Over the past six weeks I've been looking at importing the metadata of 1.1 billion taxi journeys made in New York City between 2009 and 2015 into various data stores. Each of these data stores have been aimed more towards analytical than to transactional workloads.
Though PostgreSQL does support analytical operations there isn't any automatic aggregation of field values out of the box, data cubes need to be built manually and indices are used to track down and/or exclude blocks of data rather than provide properly compact summaries that can be used to answer aggregate queries. At the end of the day aggregate query performance is proportional to the amount of data being looked at.
PostgreSQL will need to scan either the entire table or the entirety of an index which includes all rows in the table.
PostgreSQL's default storage layout stores data in an uncompressed form so reading 100 MB/s of data off a disk during a sequential scan won't yield the same amount of rows per second as it would from a gzip-compressed CSV file. The 104 GB of denormalised, compressed CSV data I've been using throughout this series of blog posts turns into 281 GB when imported into PostgreSQL.
$ sudo su - postgres -c \
"du -hs ~/9.5/main/* | grep [0-9]G"
281G /var/lib/postgresql/9.5/main/base
There are two tools that I'll look at to address these issues. The first is the new parallel aggregate support that should be appearing in PostgreSQL 9.6 and the second is the columnar store extension called cstore_fdw from Citus Data.
Please note, at this time AWS RDS only supports PostgreSQL 9.4 so parallel aggregate support is out of the question. The cstore_fdw extension isn't supported either so you would need to run your own PostgreSQL instances on EC2 to use either of these features if you're planning on using AWS.
Install PostgreSQL from Source
Parallel aggregation support was only added to the master branch of PostgreSQL six days ago so I'm going to compile the master branch in order to try out this feature.
The following was run on a fresh install of Ubuntu 14.04.3 LTS. The machine it was run on has the following key components:
- An Intel Core i5-4670K Haswell Quad-Core 3.4 GHz Processor
- 16 GB of RAM
- An 850 GB SSD drive that can perform random I/O up to 98K IOPS
First I'll install the dependencies needed to compile PostgreSQL.
$ sudo apt install \
bison \
flex \
git \
libreadline-dev \
libz-dev \
make
Then I'll checkout the codebase with a specific commit ID so that these steps are easier to reproduce.
$ git clone git://git.postgresql.org/git/postgresql.git
$ cd postgresql
$ git checkout 676265
The following will compile and install PostgreSQL and its contributed modules:
$ ./configure && \
make -j$(nproc) && \
sudo make install
$ cd contrib
$ make -j$(nproc) && \
sudo make install
I'll then add a postgres user to the system, create a directory for it to store its data and initialise that data directory.
$ sudo adduser postgres
$ sudo mkdir /usr/local/pgsql/data
$ sudo chown postgres -R \
/usr/local/pgsql/data
$ sudo su - postgres -c \
"/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data"
I'll then install the control script, make sure PostgreSQL launches when the system does and then start PostgreSQL's server.
$ sudo cp ~/postgresql/contrib/start-scripts/linux \
/etc/init.d/postgresql
$ sudo chmod +x /etc/init.d/postgresql
$ sudo update-rc.d postgresql defaults
$ sudo /etc/init.d/postgresql start
I'll set the path environment variable for my account to check the binaries directory of PostgreSQL before looking in any others when calling pathless commands.
$ PATH="/usr/local/pgsql/bin/:$PATH"
$ export PATH
I'll then add permissions for my user account to create databases:
$ sudo su - postgres -c \
'echo "CREATE USER mark;
ALTER USER mark WITH SUPERUSER;" | /usr/local/pgsql/bin/psql'
Importing 1.1 Billion Records into PostgreSQL
The following will create a trips database that will contain the taxi trip metadata.
$ createdb trips
I'll then create a table that will store all the taxi trip metadata in a denormalised form.
$ psql trips
CREATE TABLE trips (
trip_id SERIAL,
vendor_id VARCHAR(3),
pickup_datetime TIMESTAMP NOT NULL,
dropoff_datetime TIMESTAMP NOT NULL,
store_and_fwd_flag VARCHAR(1),
rate_code_id SMALLINT NOT NULL,
pickup_longitude DECIMAL(18,14),
pickup_latitude DECIMAL(18,14),
dropoff_longitude DECIMAL(18,14),
dropoff_latitude DECIMAL(18,14),
passenger_count SMALLINT NOT NULL DEFAULT '0',
trip_distance DECIMAL(6,3) DEFAULT '0.0',
fare_amount DECIMAL(6,2) DEFAULT '0.0',
extra DECIMAL(6,2) DEFAULT '0.0',
mta_tax DECIMAL(6,2) DEFAULT '0.0',
tip_amount DECIMAL(6,2) DEFAULT '0.0',
tolls_amount DECIMAL(6,2) DEFAULT '0.0',
ehail_fee DECIMAL(6,2) DEFAULT '0.0',
improvement_surcharge DECIMAL(6,2) DEFAULT '0.0',
total_amount DECIMAL(6,2) DEFAULT '0.0',
payment_type VARCHAR(6),
trip_type SMALLINT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6) NOT NULL,
precipitation SMALLINT DEFAULT '0',
snow_depth SMALLINT DEFAULT '0',
snowfall SMALLINT DEFAULT '0',
max_temperature SMALLINT DEFAULT '0',
min_temperature SMALLINT DEFAULT '0',
average_wind_speed SMALLINT DEFAULT '0',
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10),
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4),
CONSTRAINT pk_trips PRIMARY KEY (trip_id)
);
I've got a 56 x ~2 GB gzip files containing the metadata of 1.1 billion taxi journeys in a folder called ~/taxi-trips. To generate this data please see the steps in my Billion Taxi Rides in Redshift blog post.
I'll make the postgres user the owner of these files as it will use its own account to access them.
$ cd ~/taxi-trips
$ sudo chown postgres trips_x*.csv.gz
The following will import all the data from each of the 56 gzip files into the trips table.
$ screen
$ for filename in *.csv.gz
do
echo $filename
echo "COPY trips
FROM PROGRAM
'gunzip -c `pwd`/$filename'
DELIMITER ',' CSV;" | psql trips
done
The import took a little over an hour on my machine.
Benchmarking Parallel Aggregate Support
I've set PostgreSQL to use 4 parallel workers to perform the query benchmarked below.
SET max_parallel_degree = 4;
\timing on
The following shows that PostgreSQL will indeed use 4 workers when the query will run.
EXPLAIN (COSTS OFF)
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
QUERY PLAN
----------------------------------------------------
Finalize GroupAggregate
Group Key: cab_type
-> Sort
Sort Key: cab_type
-> Gather
Number of Workers: 4
-> Partial HashAggregate
Group Key: cab_type
-> Parallel Seq Scan on trips
The following completed in 1 hour 1 minute and 7 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
I've seen the above query take 3.5 hours before so it is an improvement but this performance won't hold a candle to any proper OLAP storage engine. Thankfully we can add an OLAP extension to PostgreSQL.
Citus Data's cstore_fdw Up and Running
The cstore_fdw extension from Citus Data adds support for PostgreSQL to store data in ORC format. This means the data is stored as columns instead of rows on the disk. This extension also allows those columns to be compressed.
I've found in the past that data stored as columns tends to be 50% of the size it would be if it were in CSV format and gzip compressed. The organisation of the data and the lower amounts of I/O that will be required to read the data should give aggregate query performance a huge boost.
The cstore_fdw extension currently only supports PostgreSQL 9.3, 9.4 and 9.5 so for this exercise I'll install PostgreSQL 9.5 on a fresh installation of Ubuntu 14.04.3 LTS.
I'll install PostgreSQL from the official Debian package but I'll also add tools to compile the cstore_fdw extension from source.
$ echo "deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main 9.5" | \
sudo tee /etc/apt/sources.list.d/postgresql.list
$ gpg --keyserver pgp.mit.edu --recv-keys 7FCC7D46ACCC4CF8
$ gpg --armor --export 7FCC7D46ACCC4CF8 | sudo apt-key add -
$ sudo apt update
$ sudo apt install \
git \
libpq-dev \
libprotobuf-c0-dev \
make \
postgresql-9.5 \
postgresql-server-dev-9.5 \
protobuf-c-compiler
The following will check out the latest version of the cstore_fdw extension, compile and install it.
$ git clone https://github.com/citusdata/cstore_fdw.git
$ cd cstore_fdw
$ make -j$(nproc) && \
sudo make install
I'll then change PostgreSQL's configuration so that it knows the extension is available to use.
$ sudo su - -c \
"sed \"s/#shared_preload_libraries = ''/shared_preload_libraries = 'cstore_fdw'/g\" /etc/postgresql/9.5/main/postgresql.conf -i"
$ sudo /etc/init.d/postgresql restart
Since this is a fresh install of Ubuntu, I'll have to again allow my user account to create databases in PostgreSQL.
$ sudo su - postgres -c \
"psql -c 'CREATE USER mark;
ALTER USER mark WITH SUPERUSER;'"
$ createdb trips
I'll then add the cstore_fdw extension and foreign data wrapper to PostgreSQL.
$ sudo su - postgres -c \
"echo 'CREATE EXTENSION cstore_fdw;
CREATE SERVER cstore_server
FOREIGN DATA WRAPPER cstore_fdw;' | psql trips"
I need to create a table for the trips data but this table will have a few key differences from the other versions of this table I've created in the past. The first is that primary key constraints aren't supported on foreign data wrappers so the trip_id field's data type will need to be BIGINT instead of SERIAL. The next change is that the table will be declared as a FOREIGN TABLE and contain parameters at the end of the declaration to tell the cstore_fdw extension how we want to compress and 'key frame' the data.
$ psql trips
CREATE FOREIGN TABLE trips
(
trip_id BIGINT,
vendor_id VARCHAR(3),
pickup_datetime TIMESTAMP NOT NULL,
dropoff_datetime TIMESTAMP NOT NULL,
store_and_fwd_flag VARCHAR(1),
rate_code_id SMALLINT NOT NULL,
pickup_longitude DECIMAL(18,14),
pickup_latitude DECIMAL(18,14),
dropoff_longitude DECIMAL(18,14),
dropoff_latitude DECIMAL(18,14),
passenger_count SMALLINT NOT NULL DEFAULT '0',
trip_distance DECIMAL(6,3) DEFAULT '0.0',
fare_amount DECIMAL(6,2) DEFAULT '0.0',
extra DECIMAL(6,2) DEFAULT '0.0',
mta_tax DECIMAL(6,2) DEFAULT '0.0',
tip_amount DECIMAL(6,2) DEFAULT '0.0',
tolls_amount DECIMAL(6,2) DEFAULT '0.0',
ehail_fee DECIMAL(6,2) DEFAULT '0.0',
improvement_surcharge DECIMAL(6,2) DEFAULT '0.0',
total_amount DECIMAL(6,2) DEFAULT '0.0',
payment_type VARCHAR(6),
trip_type SMALLINT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6) NOT NULL,
precipitation SMALLINT DEFAULT '0',
snow_depth SMALLINT DEFAULT '0',
snowfall SMALLINT DEFAULT '0',
max_temperature SMALLINT DEFAULT '0',
min_temperature SMALLINT DEFAULT '0',
average_wind_speed SMALLINT DEFAULT '0',
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10),
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4)
)
SERVER cstore_server
OPTIONS(compression 'pglz',
block_row_count '40000',
stripe_row_count '600000');
Importing 1.1 Billion Records into ORC Format
I've created a ~/taxi-trips folder with the 56 gzip files of taxi trip data and set their ownership to the postgres user.
$ cd ~/taxi-trips
$ sudo chown postgres trips_x*.csv.gz
I then imported the data as usual into PostgreSQL.
$ screen
$ for filename in *.csv.gz
do
echo $filename
echo "COPY trips
FROM PROGRAM
'gunzip -c `pwd`/$filename'
DELIMITER ',' CSV;" | psql trips
done
When the import is finished I can see the data is around 3.5x smaller than it was when stored using PostgreSQL's standard storage format.
$ sudo su - postgres -c "du -hs ~/9.5/main/* | grep [0-9]G"
81G /var/lib/postgresql/9.5/main/cstore_fdw
Benchmarking cstore_fdw
The following shows PostgreSQL will run a foreign scan on the cstore_fdw data when we execute a query on the trips table.
EXPLAIN (COSTS OFF)
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
QUERY PLAN
--------------------------------------------------------------------------
HashAggregate
Group Key: cab_type
-> Foreign Scan on trips
CStore File: /var/lib/postgresql/9.5/main/cstore_fdw/16385/16394
The following completed in 2 minutes and 32 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 2 minutes and 55 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 3 minutes and 55 seconds.
SELECT passenger_count,
EXTRACT(year from pickup_datetime) as year,
count(*)
FROM trips
GROUP BY passenger_count,
year;
The following completed in 6 minutes and 8 seconds.
SELECT passenger_count,
EXTRACT(year from pickup_datetime) as year,
round(trip_distance) distance,
count(*) trips
FROM trips
GROUP BY passenger_count,
year,
distance
ORDER BY year,
trips desc;