1.1 Billion Taxi Rides with MapD 3.0 & 2 GPU-Powered p2.8xlarge EC2 Instances

Update: MapD rebranded as OmniSci in 2018.

On April 26th, 2017 MapD 3.0 was released. One of the big announcements with this release is that support for clustering across multiple machines is now native to the software.

Most motherboards support more than one graphics card but at some point you'll run into a PCIe bus bottleneck. To add to that, Nvidia's drivers limit the number of GPUs visible on any one machine to 64. Being able to use multiple machines helps get around both of these limitations.

When scaling out MapD in the past I've used GPU wrapping software to cluster several graphics cards on separate machines together into one "virtual" graphics card. This in itself is still useful for other GPU-accelerated applications but with MapD 3.0's clustering support there is now the added advantage of being able to use not only all of the GPUs but all of the CPUs, memory and disk drives across the cluster as well. This allows for a more horizontally scalable deployment.

The three major cloud providers, AWS, Google Cloud and Azure all now include Nvidia Tesla K80-powered instances with their offerings. The Nvidia K80 graphics card is based on the Kepler micro-architecture rather than Maxwell or Pascal but nonetheless it sports two GK210 GPUs with a combined 4,992 CUDA cores, a 2 x 384-bit-wide memory bus and 24 GB of GDDR5 RAM. The K80 can access its memory very quickly with an aggregated throughput speed of 480 GB/s whereas a conventional CPU might only be able to access DDR4 RAM at 90 GB/s. The K80 might not be Nvidia's latest Telsa-series offering but it's still a beast.

In this benchmark I'll see how well 8 Tesla K80 cards spread across two EC2 instances perform when querying 1.1 billion taxi trips. I'll be using the same dataset I've used to benchmark Amazon Athena, BigQuery, ClickHouse, Elasticsearch, EMR, kdb+/q, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks.

A GPU-Powered AWS EC2 Cluster

I'll be using two p2.8xlarge EC2 instances in Amazon's Oregon region (us-west-2) for this benchmark. Each of these machines comes with 4 Nvidia K80s, 32 virtual CPUs reported as Intel Xeon E5-2686 v4 CPUs clocked at 2.30GHz, 488 GB of RAM and a 10 Gbps network connection.

When launched on-demand each instance costs $7.20 / hour. I did check the spot price history while putting this blog post together and for the last week the most common price was $1.80 / hour in the us-west-2b availability zone.

Below is the output from Nvidia's System Management Interface on a p2.8xlarge instance. As you can see it lists the 8 GPUs across the 4 cards individually.

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:17.0     Off |                    0 |
| N/A   62C    P0    70W / 149W |   4163MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:00:18.0     Off |                    0 |
| N/A   54C    P0    82W / 149W |   4163MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:00:19.0     Off |                    0 |
| N/A   62C    P0    69W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:00:1A.0     Off |                    0 |
| N/A   51C    P0    76W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:00:1B.0     Off |                    0 |
| N/A   63C    P0    65W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:00:1C.0     Off |                    0 |
| N/A   54C    P0    83W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:00:1D.0     Off |                    0 |
| N/A   63C    P0    71W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   55C    P0    88W / 149W |   2115MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

The first instance launched in this EC2 cluster will be both a MapD leaf and aggregator node. The second instance will be both a MapD leaf and string dictionary node. Both machines belong to a security group that allows them to communicate with one another on TCP port 19091 for the aggregator server, TCP port 9091 for leaf communication and TCP port 10301 for the string dictionary server.

I'll be using the AMI image ami-4836a428 / amzn-ami-hvm-2017.03.0.20170417-x86_64-gp2 for both machines.

Each instance has its own 1.1 TB EBS volume.

$ df -H

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        258G   62k  258G   1% /dev
tmpfs           258G     0  258G   0% /dev/shm
/dev/xvda1      1.1T  705G  352G  67% /

Downloading 1.1 Billion Taxi Journeys

On each EC2 instance I'll set the AWS CLI tool to use 100 concurrent requests so I can better saturate the network connection when downloading the taxi trips dataset off of S3.

$ aws configure set \
    default.s3.max_concurrent_requests \
    100

I'll then download the 104 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post onto each instance. This data sits across 56 GZIP files and decompresses into around 500 GB of raw CSV data.

$ mkdir ~/csvData
$ cd ~/csvData/
$ aws s3 sync s3://<s3_bucket>/csv/ ./
$ gunzip trips_x*.csv.gz

MapD 3.0 Up & Running

Everything below, unless otherwise noted, was run on both EC2 instances.

I'm going to install Nvidia's 375.51 driver along with two of its requirements, GCC and the Kernel development package which provides the headers needed to compile Kernel modules.

$ sudo yum install gcc kernel-devel-`uname -r`
$ curl -O http://us.download.nvidia.com/XFree86/Linux-x86_64/375.51/NVIDIA-Linux-x86_64-375.51.run
$ sudo /bin/bash ./NVIDIA-Linux-x86_64-375.51.run

As per Amazon's recommendations, I'm going to switch the Nvidia driver into persistent mode, turn off auto-boost and set the GPU's compute clock speed to 2,505 MHz and the GPU's memory clock speed to 875 MHz.

$ sudo nvidia-smi -pm 1
$ sudo nvidia-smi --auto-boost-default=0
$ sudo nvidia-smi -ac 2505,875

MapD is commercial software so I cannot disclose the full URL I've downloaded the self-extracting archive from. Nonetheless, these are the steps needed to install MapD. If you're interested in your own copy of MapD info@mapd.com should be able to help.

$ mkdir -p ~/prod/installs
$ cd ~/prod/installs

$ curl -O <server>/mapd2-master-Linux-x86_64.sh
$ chmod +x mapd2-master-Linux-x86_64.sh
$ ./mapd2-master-Linux-x86_64.sh
$ ln -s ~/prod/installs/mapd-3.0.0dev-20170419-b8623c9-Linux-x86_64/ \
      ~/prod/mapd

I'll then add MapD to the PATH environment variable and the Java installation to the library path.

$ export PATH=/home/ec2-user/prod/mapd/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/lib/jvm/jre/lib/amd64/server:$LD_LIBRARY_PATH

On the first EC2 instance I'll create some folders and setup the system's cluster configuration file with the following contents.

$ mkdir -p ~/prod/mapd-storage/{0,1}/data
$ vi ~/prod/mapd-storage/cluster.conf

[
  {
    "host": "node1",
    "port": 9091,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 9091,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 10301,
    "role": "string"
  }
]

I'll set the leaf node configuration to the following.

$ vi ~/prod/mapd-storage/0/mapd.conf

port = 9091
http-port = 9090
data = "/home/ec2-user/prod/mapd-storage/0/data"
read-only = false
quiet = false
string-servers = "/home/ec2-user/prod/mapd-storage/cluster.conf"

[web]
port = 9092
frontend = "/home/ec2-user/prod/mapd/frontend"

I'll set the aggregator node configuration to the following.

$ vi ~/prod/mapd-storage/1/mapd.conf

port = 19091
http-port = 19090
data = "/home/ec2-user/prod/mapd-storage/1/data"
read-only = false
quiet = false
cluster = "/home/ec2-user/prod/mapd-storage/cluster.conf"

[web]
port = 19092
frontend = "/home/ec2-user/prod/mapd/frontend"

I'll then initialise the leaf and aggregator databases and launch their respective services.

$ cd ~/prod/mapd-storage/0
$ initdb data
$ mapd_server --config mapd.conf

$ cd ~/prod/mapd-storage/1
$ initdb data
$ mapd_server --config mapd.conf

On the second EC2 instance I'll setup the strings dictionary server configuration.

$ mkdir -p ~/prod/mapd-storage/{strings,data}
$ vi ~/prod/mapd-storage/mapd-sds.conf

port = 10301
path = "/home/ec2-user/prod/mapd-storage/strings"

I'll set the leaf node configuration to the following.

$ vi ~/prod/mapd-storage/mapd.conf

port = 9091
http-port = 9090
data = "/home/ec2-user/prod/mapd-storage/data"
read-only = false
quiet = false
string-servers = "/home/ec2-user/prod/mapd-storage/cluster.conf"

[web]
port = 9092
frontend = "/home/ec2-user/prod/mapd/frontend"

I'll set the second node's cluster.conf file to the following.

$ vi ~/prod/mapd-storage/cluster.conf

[
  {
    "host": "node1",
    "port": 9091,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 9091,
    "role": "dbleaf"
  },
  {
    "host": "node2",
    "port": 10301,
    "role": "string"
  }
]

I'll then initialise the leaf node database and launch the two services.

$ cd ~/prod/mapd-storage
$ initdb data
$ mapd_server --config mapd.conf
$ StringDictionaryServer --config mapd-sds.conf

Importing 1.1 Billion Trips Into MapD

The following is the schema for my taxi trips table. The fragment size, which tells MapD how many records to spread across each GPU, will be set to 85 million for this exercise.

$ vi create_trips_table.sql

CREATE TABLE trips (
    trip_id                 INTEGER,
    vendor_id               VARCHAR(3) ENCODING DICT,

    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,

    store_and_fwd_flag      VARCHAR(1) ENCODING DICT,
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(14,2),
    pickup_latitude         DECIMAL(14,2),
    dropoff_longitude       DECIMAL(14,2),
    dropoff_latitude        DECIMAL(14,2),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(14,2),
    fare_amount             DECIMAL(14,2),
    extra                   DECIMAL(14,2),
    mta_tax                 DECIMAL(14,2),
    tip_amount              DECIMAL(14,2),
    tolls_amount            DECIMAL(14,2),
    ehail_fee               DECIMAL(14,2),
    improvement_surcharge   DECIMAL(14,2),
    total_amount            DECIMAL(14,2),
    payment_type            VARCHAR(3) ENCODING DICT,
    trip_type               SMALLINT,
    pickup                  VARCHAR(50) ENCODING DICT,
    dropoff                 VARCHAR(50) ENCODING DICT,

    cab_type                VARCHAR(6) ENCODING DICT,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10) ENCODING DICT,
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13) ENCODING DICT,
    pickup_ct2010           VARCHAR(6) ENCODING DICT,
    pickup_boroct2010       VARCHAR(7) ENCODING DICT,
    pickup_cdeligibil       VARCHAR(1) ENCODING DICT,
    pickup_ntacode          VARCHAR(4) ENCODING DICT,
    pickup_ntaname          VARCHAR(56) ENCODING DICT,
    pickup_puma             VARCHAR(4) ENCODING DICT,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10) ENCODING DICT,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13) ENCODING DICT,
    dropoff_ct2010          VARCHAR(6)  ENCODING DICT,
    dropoff_boroct2010      VARCHAR(7)  ENCODING DICT,
    dropoff_cdeligibil      VARCHAR(1)  ENCODING DICT,
    dropoff_ntacode         VARCHAR(4)  ENCODING DICT,
    dropoff_ntaname         VARCHAR(56) ENCODING DICT,
    dropoff_puma            VARCHAR(4)  ENCODING DICT
) WITH (FRAGMENT_SIZE=85000000);

I'll create an environment variable with my credentials for MapD.

$ read MAPD_PASSWORD
$ export MAPD_PASSWORD

I'll then execute the above SQL script through the aggregator so that the table is created on both leaf nodes.

$ mapdql -p $MAPD_PASSWORD \
         --port 19091 \
         < create_trips_table.sql

I'll be cutting the dataset roughly in half with each node holding around 50% of the records each. Each node has a distinctive version of a load.sql script where they load different parts of the 56-file dataset.

This is the script for the first node.

$ cat load1.sql

COPY trips FROM '/home/ec2-user/csvData/trips_xa*.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xca.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcb.csv' WITH (header='false');

This is the script for the second node.

$ cat load2.sql

COPY trips FROM '/home/ec2-user/csvData/trips_xb*.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcc.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcd.csv' WITH (header='false');

To avoid going through the aggregator and having it become a bottleneck, the following commands were coordinated from the first server and executed independently on each of the individual leaf nodes.

$ time mapdql -p $MAPD_PASSWORD --server 127.0.0.1 < load1.sql &
$ time mapdql -p $MAPD_PASSWORD --server 172.31.19.152 < load2.sql &

The first node loaded its dataset in 21 minutes and 58 seconds and the second node loaded its dataset in 26 minutes and 24 seconds.

The above load times are where the clustered approach really begins to shine. Last June I did a benchmark with 8 Nvidia Tesla K80s on a single machine and it took 51 minutes and 47 seconds to import this dataset. It's great to see these two machines are able to break apart this job and complete it in roughly half the time.

With the data loaded, I can connect via the aggregator and see that all 1.1 billion records are visible.

$ mapdql -p $MAPD_PASSWORD \
         --port 19091

mapdql> select count(*) from trips;

EXPR$0
1113653018

Benchmarking MapD 3.0

The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".

$ mapd/bin/mapdql \
    mapd \
    -u $MAPD_USERNAME \
    -p $MAPD_PASSWORD

\timing on

The following completed in 0.034 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY cab_type;

The following completed in 0.061 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY passenger_count;

The following completed in 0.178 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       count(*)
FROM trips
GROUP BY passenger_count,
         pickup_year;

The following completed in 0.498 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       cast(trip_distance as int) AS distance,
       count(*) AS the_count
FROM trips
GROUP BY passenger_count,
         pickup_year,
         distance
ORDER BY pickup_year,
         the_count desc;

Although the above results place this benchmark in 4th place on my benchmarks recap board it's worth mentioning the recap board is sorted by the fastest query 1 time. Query 2's time out-performed all but one other benchmark and query 4's time is the fastest ever. Query 4 seems to scale linearly with the number of nodes in the cluster as I've run this query in the past on a single machine with the same number of K80s and execution took almost twice as long.

When I first heard MapD would natively support clustering I wondered what impact network overhead would have with their implementation. It's great to see MapD still optimises data locality extremely well and network chatter doesn't seem to have a noticeable impact on query times.

If I had launched the cluster using spot instances I would have spent $3.20 / hour. I'm pretty sure this is the best price for performance I've seen in any of my benchmarks.

This benchmark ran on what is now a two-generations-old micro-architecture from Nvidia. I can't imagine what kind of numbers would been seen on their Pascal-powered P100 GPUs nor what times would look like if I'd used more than two EC2 instances. The performance results are very impressive to say the least but what's more impressive is that I know there is a lot more performance yet to be unlocked.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.