Home | Benchmarks | Archives | Atom Feed

Posted on Mon 11 December 2017

1.1 Billion Taxi Rides with BrytlytDB 2.1 & a 5-node IBM Minsky Cluster

For more than a year I've been trying to gain access to what I would consider to be a super computing cluster that I could run a GPU-based database benchmark on. There have been several challenges I've come across during this time. Some software providers aren't certain their code will scale up to the hardware on offer and often the owners of the hardware are busy running workloads that could very well solve some of the world's biggest challenges.

BrytlytDB recently added support for IBM's POWER8 CPU architecture and I've been kindly granted access to a five-node IBM Power System S822LC for HPC cluster to run my 1.1 Billion Taxi Rides benchmark on. The systems are more commonly known by their code-name "Minsky" and, in aggregate, the cluster has 20 Nvidia P100 GPUs with a combined total of 71,680 CUDA cores and 320 GB of HBM2 memory that the CUDA cores can communicate with at 14,640 GB/s.

For those unfamiliar with BrytlytDB I'd describe it as PostgreSQL for GPUs. All the regular features of PostgreSQL like Joins, Stored Procedures, ODBC (important for Tableau users) work as expected but once PostgreSQL puts together an execution plan for a query BrytlytDB takes over and executes it on one or more of Nvidia's CUDA-powered GPUs. The software has the added benefit of being able to run single queries across multiple GPUs on multiple machines.

In this benchmark I'll see how well the latest version 2.1 of BrytlytDB runs on five IBM Minsky Servers when querying 1.1 billion taxi trips. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. In CSV format the data is about 500 GB in size. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

The Hardware

In this benchmark I'll be using five IBM Power System S822LC for HPC servers. Each 2U server comes with two 4 GHz, 8-core by 8-thread, POWER8-based CPUs.

$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    8
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Model:                 1.0 (pvr 004c 0100)
Model name:            POWER8NVL (raw), altivec supported
CPU max MHz:           4023.0000
CPU min MHz:           2061.0000
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-63
NUMA node1 CPU(s):     64-127

The CPUs were set to performance mode for this benchmark.

$ sudo cpupower frequency-set -g performance

There is half a terabyte of DDR4 RAM on each machine.

$ cat /proc/meminfo | head -n1
MemTotal:       535690368 kB

Each server has four Nvidia Tesla P100-SXM2-16GB cards. These cards have 3,584 CUDA cores, 16 GB of HBM2 memory that the CUDA cores can communicate with at 732 GB/s and a total power draw of 300 Watts.

$ nvidia-smi
| NVIDIA-SMI 361.93.02              Driver Version: 361.93.02                 |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-SXM2...  On   | 0002:01:00.0     Off |                    0 |
| N/A   33C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
|   1  Tesla P100-SXM2...  On   | 0003:01:00.0     Off |                    0 |
| N/A   34C    P0    29W / 300W |      0MiB / 16280MiB |      0%      Default |
|   2  Tesla P100-SXM2...  On   | 000A:01:00.0     Off |                    0 |
| N/A   33C    P0    28W / 300W |      0MiB / 16280MiB |      0%      Default |
|   3  Tesla P100-SXM2...  On   | 000B:01:00.0     Off |                    0 |
| N/A   29C    P0    28W / 300W |      0MiB / 16280MiB |      0%      Default |

The compute cores on the GPUs are clocked at 1.48 GHz, the memory clocks are set to 715 MHz and performance mode is switched on.

$ sudo nvidia-smi -ac 715,1480
$ sudo nvidia-smi -pm ENABLED

The Minsky boxes are equipped with NVLink which provides much faster interconnectivity between the CPUs and GPUs than you'd find in a PCIe-based system. The CPUs can speak to the system's RAM at 115 GB/s, the CPUs and GPUs can all communicate between one another at 80 GB/s thanks in part to NVLink and the GPUs can speak to their memory at 732 GB/s. The ratio of two P100 cards for each CPU seems to create a good balance for BrytlytDB.

Each server has a Seagate 1 TB 2.5" 7200 RPM hard disk drive and a 1.6 TB NVMe Flash drive. Below shows how the partitions are setup.

$ lsblk
sda           8:0    1 931.5G  0 disk
|-sda1        8:1    1     7M  0 part
|-sda2        8:2    1 929.6G  0 part /
`-sda3        8:3    1   1.9G  0 part [SWAP]
sdb           8:16   1 931.5G  0 disk
sr0          11:0    1  1024M  0 rom
sr1          11:1    1  1024M  0 rom
sr2          11:2    1  1024M  0 rom
sr3          11:3    1  1024M  0 rom
nvme0n1     259:0    0   1.5T  0 disk
`-nvme0n1p1 259:1    0   1.5T  0 part /nvme3T

I ran dd on one of the NVMe drives to get an idea of what sort of single-process, single-thread speeds could be seen with the drive. I was able to max out writes at 3.8 GB/s and reads at 1 GB/s.

$ sudo dd if=/dev/zero of=/nvme3T/test bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.137118 s, 3.8 GB/s
$ sync
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ dd if=/nvme3T/test of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 0.520857 s, 1.0 GB/s

Though the Minsky machines do come with InfiniBand network ports BrytlytDB keeps network chatter at a minimum so for this benchmark we used a 1 Gbps network for communication between the five machines.

BrytlytDB Up and Running

Each of the servers is running Ubuntu 16.04.1 LTS with a 4.4.0-57-generic Kernel compiled for ppc64le. Docker version 17.06.0-ce along with Docker Engine Utility for NVIDIA GPUs were already installed on the machines when I gained access to them.

For this benchmark I'll be running the new version 2.1 of BrytlytDB which supports POWER8. In order to minimise the installation steps I'll be using a Docker Image with the software already setup.

$ sudo nvidia-docker login
$ sudo nvidia-docker pull brytlytdb/ppc64-4n

I then launched the containers on each of the five servers. PostgreSQL's CLI client port 5432 and BrytlytDB's data node communication ports will be exposed from the Docker container.

$ sudo nvidia-docker run \
    -dti \
    --name=cluster \
    -p 5432:5432 \
    -p 30000-30003:30000-30003 \

$ sudo nvidia-docker exec \
    cluster /root/brytlyt start

I do have to highlight that there was an oversight: these containers were launched on the mechanical disks rather than the NVMe drives on each of the machines. This will have an impact on the performance of any I/O operation BrytlytDB performs. I'm disappointed this happened and I would have rerun the benchmark but I only noticed this after I surrendered the servers back to their owner.

On the coordinator I'll register the other data nodes. The network setup is as such that the other machines are addressed via non-reserved IPv4 addresses so for the sake of the owner's privacy I'll be obscuring the first three IPv4 classes.

$ sudo docker exec -ti cluster \
      /usr/local/brytlyt/bin/psql \
      brytlyt brytlyt
CREATE NODE d04 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15432);
CREATE NODE d05 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15433);
CREATE NODE d06 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15434);
CREATE NODE d07 WITH (HOST='x.x.x.213', TYPE='datanode', PORT=15435);

CREATE NODE d08 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15432);
CREATE NODE d09 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15433);
CREATE NODE d10 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15434);
CREATE NODE d11 WITH (HOST='x.x.x.103', TYPE='datanode', PORT=15435);

CREATE NODE d12 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15432);
CREATE NODE d13 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15433);
CREATE NODE d14 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15434);
CREATE NODE d15 WITH (HOST='x.x.x.104', TYPE='datanode', PORT=15435);

CREATE NODE d16 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15432);
CREATE NODE d17 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15433);
CREATE NODE d18 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15434);
CREATE NODE d19 WITH (HOST='x.x.x.105', TYPE='datanode', PORT=15435);

SELECT pgxc_pool_reload();

With those in place there is now a data node for each of the 20 GPUs in the cluster and they've been federated on the coordinator.

Loading 1.1 Billion Trips into BrytlytDB

I'll first define the table schema for the 1.1 billion taxi trip records.

$ sudo docker exec -ti cluster \
      /usr/local/brytlyt/bin/psql \
      brytlyt brytlyt
    trip_id                 INTEGER,
    vendor_id               VARCHAR(3),

    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,

    store_and_fwd_flag      VARCHAR(1),
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE PRECISION,
    pickup_latitude         DOUBLE PRECISION,
    dropoff_longitude       DOUBLE PRECISION,
    dropoff_latitude        DOUBLE PRECISION,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE PRECISION,
    fare_amount             DOUBLE PRECISION,
    extra                   DOUBLE PRECISION,
    mta_tax                 DOUBLE PRECISION,
    tip_amount              DOUBLE PRECISION,
    tolls_amount            DOUBLE PRECISION,
    ehail_fee               DOUBLE PRECISION,
    improvement_surcharge   DOUBLE PRECISION,
    total_amount            DOUBLE PRECISION,
    payment_type            VARCHAR(10),
    trip_type               SMALLINT,
    pickup                  VARCHAR(50),
    dropoff                 VARCHAR(50),

    cab_type                VARCHAR(10),

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10),
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13),
    pickup_ct2010           VARCHAR(6),
    pickup_boroct2010       VARCHAR(7),
    pickup_cdeligibil       VARCHAR(1),
    pickup_ntacode          VARCHAR(4),
    pickup_ntaname          VARCHAR(56),
    pickup_puma             VARCHAR(4),

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10),
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13),
    dropoff_ct2010          VARCHAR(6),
    dropoff_boroct2010      VARCHAR(7),
    dropoff_cdeligibil      VARCHAR(1),
    dropoff_ntacode         VARCHAR(4),
    dropoff_ntaname         VARCHAR(56),
    dropoff_puma            VARCHAR(4)
) SERVER gm_fdw_server OPTIONS (
    max_size '55682651',
    index '10:24')

In the above table declaration there are indices for both the passenger count and cab type columns.

The gm_fdw_server identifier tells PostgreSQL to connect to BrytlytDB's GPU foreign data wrapper.

The 55,682,651 max_size value is the maximum number of rows I'm intending to import onto each data node. There are 20 data nodes so this should cover the 1.1 billion total records size of the dataset.

I'm using the Round Robin data distribution policy here. This table won't be joined with any other data, if it were HASH and/or REPLICATE would be better options.

Note that since my last BrytlytDB benchmark version 2.1 of the software has been released and there is now support for both TIMESTAMP and SMALLINT fields.

For this benchmark I've downloaded and decompressed the 500 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post. I'll dedicate 1/5th of the dataset for each of the five servers. On each server its dataset is broken up again into four parts so that each CSV file targets one data node. There is a 1-to-1 ratio of data nodes to Tesla P100 cards. The following was all run on the coordinator.

$ vi urls.txt
$ cat urls.txt | \
    xargs -n 1 -P 6 \
$ find trips_x*.csv.gz | \
    xargs -n 1 -P 16 \
$ cat trips_x*.csv \
    | split --lines=55682651 \

for filename in xa*.csv; do
    /usr/local/brytlyt/bin/psql brytlyt brytlyt \
        -c "EXECUTE DIRECT ON ( data_datanode$count )
                'SELECT load_data(''trips'',
                                  ''append'')'" 2>&1 &
    /usr/local/brytlyt/bin/psql brytlyt brytlyt \
        -c "EXECUTE DIRECT ON ( data_datanode$(($count+4)) )
                'SELECT load_data(''trips'',
                                  ''append'')'" 2>&1 &
    /usr/local/brytlyt/bin/psql brytlyt brytlyt \
        -c "EXECUTE DIRECT ON ( data_datanode$(($count+8)) )
                'SELECT load_data(''trips'',
                                  ''append'')'" 2>&1 &
    /usr/local/brytlyt/bin/psql brytlyt brytlyt \
        -c "EXECUTE DIRECT ON ( data_datanode$(($count+12)) )
                'SELECT load_data(''trips'',
                                  ''append'')'" 2>&1 &
    /usr/local/brytlyt/bin/psql brytlyt brytlyt \
        -c "EXECUTE DIRECT ON ( data_datanode$(($count+16)) )
                'SELECT load_data(''trips'',
                                  ''append'')'" 2>&1 &
    (( count++ ))

The above took around 2.5 hours to complete.

Benchmarking BrytlytDB

The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".

$ sudo docker exec \
    -ti cluster /usr/local/brytlyt/bin/psql \
        brytlyt brytlyt
\timing on

The following completed in 0.0048 seconds.

SELECT cab_type,
FROM trips
GROUP BY cab_type;

The following completed in 0.011 seconds.

SELECT passenger_count,
FROM trips
GROUP BY passenger_count;

The following completed in 0.103 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
FROM trips
GROUP BY passenger_count,

The following completed in 0.188 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       cast(trip_distance as int) AS distance,
       count(*) AS the_count
FROM trips
GROUP BY passenger_count,
ORDER BY pickup_year,
         the_count desc;

The above times represent a brand new fastest benchmark. Brytlyt has superseded their own previously held fastest record by a factor of two.  When I started doing these benchmarks getting results in minutes felt like an accomplishment. Then I began using more OLAP-based systems, query times came down to seconds and again I was celebrating the hardware and software setups I'd discovered. I'm now looking at a query one speed that could run 208 times in a single second. That’s twenty-four times faster than the speed of human perception of 120 milliseconds, and that is simply amazing.

Firms and institutions purchase clusters like this for a variety of applications and they expect to be able to run multiple workloads on their investment. After laying out such a large amount of capital it would leave a bad taste in my mouth to find my applications were bottlenecked and couldn't reasonably scale up to the hardware capability on hand. I'm extremely impressed this is not the case with BrytlytDB which has managed to make effective use of such a powerful cluster.

It's also good to see BrytlytDB's Docker image help keep the installation steps to a minimum on bare metal. Using Docker to tuck away setup complexities is a real timesaver and it's another feature I admire about the software.

Practically every component on each of these Minsky boxes were some of the most powerful I've ever worked with. Having more than a terabyte of NVMe storage, InfiniBand ports at the ready, 128 CPU cores and to top that off the Nvidia P100 cards was just incredible. Though I couldn't max out every single aspect of these machines I'm very grateful to be given an opportunity to work with such a powerful cluster.

Thank you for taking the time to read this post. I offer consulting, architecture and hands-on development services to clients in North America & Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2017 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.