1.1 Billion Taxi Rides with MapD & AWS EC2

Update: MapD rebranded as OmniSci in 2018.

In my last two blog posts I've looked at querying 1.1 billion taxi trips made in New York City over the course of six years using MapD on 8 Nvidia Telsa K80 GPU cards and 4 Nvidia Titan X cards respectively. In both of these scenarios I was working with bare metal. In this blog post I'll look at running the same benchmark on 4 g2.8xlarge EC2 instances on AWS.

To make all the GPUs on each of the 4 instances look like they're on the same machine I'll be using Bitfusion Boost wrapper around MapD.

An EC2 GPU Cluster Up & Running

Bitfusion has an AWS Marketplace product called Bitfusion Boost Ubuntu 14 Cuda 7.5 that will launch a cluster using CloudFormation and bootstrap each node in the cluster with their software and dependencies, so that in the end it all looks like a giant instance with 16 GPUs.

The g2.8xlarge instances each have 2 Nvidia GRID K520 cards and each card has 2 GPUs and 8 GB GDDR5 of memory (4 GB per card). The last price I saw for each of the cards on Amazon.com was $1,590 so $2.808 / hour for two of them seems like a steal to me.

On top of the 4 g2.8xlarge instances, a storage-optimised i2.4xlarge instance will be launched which will act as a head node of the cluster and store all the data when it's at rest.

Once CloudFormation has done its thing we can SSH into the head node.

$ ssh ubuntu@52.38.222.214

Once connect we're greeted with the following message:

########################################################################################################################
########################################################################################################################

  ____  _ _    __           _               _
 | __ )(_) |_ / _|_   _ ___(_) ___  _ __   (_) ___
 |  _ \| | __| |_| | | / __| |/ _ \| '_ \  | |/ _ \
 | |_) | | |_|  _| |_| \__ \ | (_) | | | |_| | (_) |
 |____/|_|\__|_|  \__,_|___/_|\___/|_| |_(_)_|\___/


Welcome to Bitfusion Boost Ubuntu 14 Cuda 7.5 - Ubuntu 14.04 LTS (GNU/Linux 3.13.0-88-generic x86_64)

This AMI is brought to you by Bitfusion.io
http://www.bitfusion.io

Please email all feedback and support requests to:
amisupport@bitfusion.io

We would love to hear from you! Contact us with any feedback or a feature request at the email above.

########################################################################################################################
########################################################################################################################


    Instance Type:  i2.4xlarge
    Number of GPUs: 16

Downloading 1.1 Billion Taxi Journeys

I'll download the 104 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post. This data sits in 56 GZIP files and decompresses into around 500 GB of raw CSV data.

But before I can download that data I'll create a RAID 0, striped array for the data to live on. The i2.4xlarge instance type is supposed to come with 4 drives but lsblk was only showing three being available when we launched this instance. I need about 500 GB of space for the uncompressed CSV files and 350 GB of space for MapD's internal representation of this data so the 1.5 TB of space I'll get from the two drives should be more than enough.

The first drive is already in use so I'll fdisk the two remaining drives, use mdadm to setup the RAID array and then I'll format the array.

$ sudo apt update
$ sudo apt install mdadm

$ sudo fdisk /dev/xvdb
$ sudo fdisk /dev/xvdc

$ sudo mdadm \
    -C /dev/md0 \
    -l raid0 \
    -n 2 \
    /dev/xvd[b-c]1

$ sudo mkfs.ext3 /dev/md0

Now that the drives are ready I'll mount them to the /data directory.

$ sudo mkdir /data
$ sudo mount /dev/md0 /data
$ sudo chown -R ubuntu:ubuntu /data

Once that's done we can see the RAID array is setup by running lsblk.

$ lsblk

NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
xvda    202:0    0   500G  0 disk
└─xvda1 202:1    0   500G  0 part  /
xvdb    202:16   0 745.2G  0 disk
└─xvdb1 202:17   0 745.2G  0 part
  └─md0   9:0    0   1.5T  0 raid0 /data
xvdc    202:32   0 745.2G  0 disk
└─xvdc1 202:33   0 745.2G  0 part
  └─md0   9:0    0   1.5T  0 raid0 /data

I'll now use the AWS CLI tool to download the 104 GB of compressed CSV data and store it on the RAID array.

$ mkdir /data/benchdata
$ cd /data/benchdata

$ aws configure set \
    default.s3.max_concurrent_requests \
    100

$ aws s3 sync s3://<s3_bucket>/csv/ ./

It took 11 minutes and 42 seconds to download the data from the Irish S3 bucket I keep the taxi data in to the cluster in Oregon at the rate of 1.185 Gbit/s.

MapD doesn't support loading CSV data from GZIP files at this time so I'll decompress the CSV files before loading them. The i2.4xlarge instance type has 16 vCPUs so I'll use xargs to run 16 simultaneous instances of gunzip.

$ find *.gz | xargs -n 1 -P 16 gunzip

Updating Bitfusion

Normally Bitfusion's AMI will have the latest version of Boost but I was told prior to running the benchmark there was a last-minute release with a number of improvements so to take advantage of them I updated each g2.8xlarge instance.

The following shows the private IP addresses of each node in the GPU cluster.

$ cat /etc/bitfusionio/adaptor.conf

10.0.0.207
10.0.0.208
10.0.0.209
10.0.0.210

I then SSH'ed into each instance and updated the software with the following steps:

$ ssh -i ~/.ssh/mykey.pem 10.0.0.207

$ sudo service bfboost-opencl stop
$ sudo service bfboost-cuda-server stop

$ sudo apt update
$ sudo apt install bfboost

$ sudo sed -i "s/respawn/respawn\nlimit nofile 1024000 1024000/" \
              /etc/init/bfboost-cuda-server.conf
$ sudo sed -i "s/respawn/respawn\nlimit nofile 1024000 1024000/" \
              /etc/init/bfboost-opencl.conf

$ sudo service bfboost-cuda-server start
$ sudo service bfboost-opencl start

I then checked to make sure I'm running the specific point release I was looking to upgrade to.

$ sudo dpkg --list | grep bfboost

0.1.0+1532

The following is the output from Nvidia's system management interface on a g2.8xlarge instance. Note this tool isn't running via the Bitfusion Boost wrapper and is only showing the hardware available to this one specific instance.

$ nvidia-smi

+------------------------------------------------------+
| NVIDIA-SMI 352.93     Driver Version: 352.93         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 0000:00:03.0     Off |                  N/A |
| N/A   31C    P8    17W / 125W |   2098MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           On   | 0000:00:04.0     Off |                  N/A |
| N/A   30C    P8    17W / 125W |   2098MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           On   | 0000:00:05.0     Off |                  N/A |
| N/A   32C    P8    18W / 125W |   2098MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           On   | 0000:00:06.0     Off |                  N/A |
| N/A   31C    P8    17W / 125W |   2098MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      6784    C   /usr/bin/cuda-server                          2085MiB |
|    1      6784    C   /usr/bin/cuda-server                          2085MiB |
|    2      6784    C   /usr/bin/cuda-server                          2085MiB |
|    3      6784    C   /usr/bin/cuda-server                          2085MiB |
+-----------------------------------------------------------------------------+

And now back on the head node I can now run Bitfusion Boost and see 16 GPUs available.

$ bfboost client \
    /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 16 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4096 MBytes (4294770688 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
...

MapD Up & Running

I will now install MapD. MapD is commercial software, but if you are interested in doing this yourself or in proof of concept with your own datasets, you can email them at info@mapd.com. For this benchmark I was kindly provided with the latest version of their software via a link.

$ mkdir -p /data/prod/
$ cd /data/prod/

$ wget http://.../mapd2-latest-Linux-x86_64.sh
$ chmod +x mapd2-latest-Linux-x86_64.sh
$ ./mapd2-latest-Linux-x86_64.sh

I'll then create a symlink so I have a simpler folder name to work with.

$ ln -s mapd2-1.2.0-20160711-90d6743-Linux-x86_64/ mapd

I'll then create a data directory for MapD to store its internal files in and initialise the database.

$ mkdir -p /data/prod/mapd-storage/data
$ mapd/bin/initdb --data /data/prod/mapd-storage/data

MapD has a Java dependency and the LD_LIBRARY_PATH environment variable that comes out of the box with the i2.4xlarge instance isn't pointing at the correct location so the following will fix that.

$ export LD_LIBRARY_PATH=/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/amd64/server/

With all those dependencies in place we can launch MapD's server within the Bitfusion Boost wrapper.

$ nohup bfboost client \
    "mapd/bin/mapd_server --data /data/prod/mapd-storage/data" &

Importing 1.1 Billion Trips Into MapD

I'll create a schema for the trips table and use a fragment size of 75 million.

$ vi create_trips_table.sql

CREATE TABLE trips (
    trip_id                 INTEGER,
    vendor_id               VARCHAR(3) ENCODING DICT,

    pickup_datetime         TIMESTAMP,

    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      VARCHAR(1) ENCODING DICT,
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(14,2),
    pickup_latitude         DECIMAL(14,2),
    dropoff_longitude       DECIMAL(14,2),
    dropoff_latitude        DECIMAL(14,2),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(14,2),
    fare_amount             DECIMAL(14,2),
    extra                   DECIMAL(14,2),
    mta_tax                 DECIMAL(14,2),
    tip_amount              DECIMAL(14,2),
    tolls_amount            DECIMAL(14,2),
    ehail_fee               DECIMAL(14,2),
    improvement_surcharge   DECIMAL(14,2),
    total_amount            DECIMAL(14,2),
    payment_type            VARCHAR(3) ENCODING DICT,
    trip_type               SMALLINT,
    pickup                  VARCHAR(50) ENCODING DICT,
    dropoff                 VARCHAR(50) ENCODING DICT,

    cab_type                VARCHAR(6) ENCODING DICT,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10) ENCODING DICT,
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13) ENCODING DICT,
    pickup_ct2010           VARCHAR(6) ENCODING DICT,
    pickup_boroct2010       VARCHAR(7) ENCODING DICT,
    pickup_cdeligibil       VARCHAR(1) ENCODING DICT,
    pickup_ntacode          VARCHAR(4) ENCODING DICT,
    pickup_ntaname          VARCHAR(56) ENCODING DICT,
    pickup_puma             VARCHAR(4) ENCODING DICT,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10) ENCODING DICT,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13) ENCODING DICT,
    dropoff_ct2010          VARCHAR(6)  ENCODING DICT,
    dropoff_boroct2010      VARCHAR(7)  ENCODING DICT,
    dropoff_cdeligibil      VARCHAR(1)  ENCODING DICT,
    dropoff_ntacode         VARCHAR(4)  ENCODING DICT,
    dropoff_ntaname         VARCHAR(56) ENCODING DICT,
    dropoff_puma            VARCHAR(4)  ENCODING DICT
) WITH (FRAGMENT_SIZE=75000000);

I'll create two environment variables with my credentials for MapD.

$ read MAPD_USERNAME
$ read MAPD_PASSWORD
$ export MAPD_USERNAME
$ export MAPD_PASSWORD

The following will create the table schema using the mapdql CLI tool.

$ mapd/bin/mapdql mapd \
    -u $MAPD_USERNAME \
    -p $MAPD_PASSWORD \
    < create_trips_table.sql

With the table and files in place I'll load the 500 GB of CSV data into MapD. The following completed in 60 minutes and 42 seconds.

$ for filename in /data/benchdata/*.csv; do
      echo "COPY trips
            FROM '$filename'
            WITH (header='false');" | \
          mapd/bin/mapdql \
              mapd \
              -u $MAPD_USERNAME \
              -p $MAPD_PASSWORD
  done

Here's a snapshot of top during the import.

top - 07:06:57 up  2:13,  2 users,  load average: 10.75, 10.51, 10.25
Tasks: 479 total,   1 running, 478 sleeping,   0 stopped,   0 zombie
%Cpu0  : 74.3 us,  5.3 sy,  0.0 ni, 20.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 75.1 us,  1.7 sy,  0.0 ni, 23.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 74.0 us,  2.3 sy,  0.0 ni, 23.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 72.2 us,  3.6 sy,  0.0 ni, 24.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 75.3 us,  2.0 sy,  0.0 ni, 22.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 74.8 us,  2.3 sy,  0.0 ni, 22.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 72.1 us,  3.3 sy,  0.0 ni, 24.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 73.1 us,  9.3 sy,  0.0 ni, 17.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 74.2 us,  2.0 sy,  0.0 ni, 23.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 73.7 us,  2.7 sy,  0.0 ni, 23.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 74.9 us,  2.3 sy,  0.0 ni, 22.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 74.1 us,  3.0 sy,  0.0 ni, 22.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 : 76.5 us,  1.3 sy,  0.0 ni, 22.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 75.2 us,  2.3 sy,  0.0 ni, 22.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 : 72.8 us,  3.3 sy,  0.0 ni, 23.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 74.1 us,  4.3 sy,  0.0 ni, 21.3 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  12590446+total, 12531040+used,   594064 free,   897984 buffers
KiB Swap:        0 total,        0 used,        0 free. 11976936+cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 19915 ubuntu    20   0 4611148 1.278g  20028 S  1231  1.1 538:04.36 mapd/bin/mapd_server --data /data/prod/mapd-storage/data
 19910 ubuntu    20   0   63008   4520   3340 S   0.0  0.0   0:00.00 bfboost client mapd/bin/mapd_server --data /data/prod/mapd-storage/+
...

Once the import is complete 894 GB of capacity is now being used on the RAID array.

$ df -H

Filesystem      Size  Used Avail Use% Mounted on
udev             65G   13k   65G   1% /dev
tmpfs            13G  844k   13G   1% /run
/dev/xvda1      529G  6.4G  501G   2% /
none            4.1k     0  4.1k   0% /sys/fs/cgroup
none            5.3M     0  5.3M   0% /run/lock
none             65G   13k   65G   1% /run/shm
none            105M     0  105M   0% /run/user
/dev/md0        1.6T  894G  602G  60% /data

Benchmarking MapD

The times quoted below are the lowest query times seen during a series of runs.

$ mapd/bin/mapdql \
    mapd \
    -u $MAPD_USERNAME \
    -p $MAPD_PASSWORD

\timing on

The following completed in 0.028 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY cab_type;

The following completed in 0.2 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY passenger_count;

The following completed in 0.237 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       count(*)
FROM trips
GROUP BY passenger_count,
         pickup_year;

The following completed in 0.578 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime) AS pickup_year,
       cast(trip_distance as int) AS distance,
       count(*) AS the_count
FROM trips
GROUP BY passenger_count,
         pickup_year,
         distance
ORDER BY pickup_year,
         the_count desc;

It's amazing to see that $14.98 / hour's worth of EC2 instances can query 1.1 Billion records 56x faster than any CPU-based solution I've run so far. Going forward, the ability of MapD and Bitfusion to combine GPUs to create clusters of on-demand supercomputers, will have a profound impact on the size of the potential problems that can now be tackled. The future of BI is going to be on GPUs.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.