Update: MapD rebranded as OmniSci in 2018.
On April 26th, 2017 MapD 3.0 was released. One of the big announcements with this release is that support for clustering across multiple machines is now native to the software.
Most motherboards support more than one graphics card but at some point you'll run into a PCIe bus bottleneck. To add to that, Nvidia's drivers limit the number of GPUs visible on any one machine to 64. Being able to use multiple machines helps get around both of these limitations.
When scaling out MapD in the past I've used GPU wrapping software to cluster several graphics cards on separate machines together into one "virtual" graphics card. This in itself is still useful for other GPU-accelerated applications but with MapD 3.0's clustering support there is now the added advantage of being able to use not only all of the GPUs but all of the CPUs, memory and disk drives across the cluster as well. This allows for a more horizontally scalable deployment.
The three major cloud providers, AWS, Google Cloud and Azure all now include Nvidia Tesla K80-powered instances with their offerings. The Nvidia K80 graphics card is based on the Kepler micro-architecture rather than Maxwell or Pascal but nonetheless it sports two GK210 GPUs with a combined 4,992 CUDA cores, a 2 x 384-bit-wide memory bus and 24 GB of GDDR5 RAM. The K80 can access its memory very quickly with an aggregated throughput speed of 480 GB/s whereas a conventional CPU might only be able to access DDR4 RAM at 90 GB/s. The K80 might not be Nvidia's latest Telsa-series offering but it's still a beast.
In this benchmark I'll see how well 8 Tesla K80 cards spread across two EC2 instances perform when querying 1.1 billion taxi trips. I'll be using the same dataset I've used to benchmark Amazon Athena, BigQuery, ClickHouse, Elasticsearch, EMR, kdb+/q, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks.
A GPU-Powered AWS EC2 Cluster
I'll be using two p2.8xlarge EC2 instances in Amazon's Oregon region (us-west-2) for this benchmark. Each of these machines comes with 4 Nvidia K80s, 32 virtual CPUs reported as Intel Xeon E5-2686 v4 CPUs clocked at 2.30GHz, 488 GB of RAM and a 10 Gbps network connection.
When launched on-demand each instance costs $7.20 / hour. I did check the spot price history while putting this blog post together and for the last week the most common price was $1.80 / hour in the us-west-2b availability zone.
Below is the output from Nvidia's System Management Interface on a p2.8xlarge instance. As you can see it lists the 8 GPUs across the 4 cards individually.
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:00:17.0 Off | 0 |
| N/A 62C P0 70W / 149W | 4163MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:00:18.0 Off | 0 |
| N/A 54C P0 82W / 149W | 4163MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:00:19.0 Off | 0 |
| N/A 62C P0 69W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:00:1A.0 Off | 0 |
| N/A 51C P0 76W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 On | 0000:00:1B.0 Off | 0 |
| N/A 63C P0 65W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 On | 0000:00:1C.0 Off | 0 |
| N/A 54C P0 83W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 On | 0000:00:1D.0 Off | 0 |
| N/A 63C P0 71W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 On | 0000:00:1E.0 Off | 0 |
| N/A 55C P0 88W / 149W | 2115MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
The first instance launched in this EC2 cluster will be both a MapD leaf and aggregator node. The second instance will be both a MapD leaf and string dictionary node. Both machines belong to a security group that allows them to communicate with one another on TCP port 19091 for the aggregator server, TCP port 9091 for leaf communication and TCP port 10301 for the string dictionary server.
I'll be using the AMI image ami-4836a428 / amzn-ami-hvm-2017.03.0.20170417-x86_64-gp2 for both machines.
Each instance has its own 1.1 TB EBS volume.
$ df -H
Filesystem Size Used Avail Use% Mounted on
devtmpfs 258G 62k 258G 1% /dev
tmpfs 258G 0 258G 0% /dev/shm
/dev/xvda1 1.1T 705G 352G 67% /
Downloading 1.1 Billion Taxi Journeys
On each EC2 instance I'll set the AWS CLI tool to use 100 concurrent requests so I can better saturate the network connection when downloading the taxi trips dataset off of S3.
$ aws configure set \
default.s3.max_concurrent_requests \
100
I'll then download the 104 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post onto each instance. This data sits across 56 GZIP files and decompresses into around 500 GB of raw CSV data.
$ mkdir ~/csvData
$ cd ~/csvData/
$ aws s3 sync s3://<s3_bucket>/csv/ ./
$ gunzip trips_x*.csv.gz
MapD 3.0 Up & Running
Everything below, unless otherwise noted, was run on both EC2 instances.
I'm going to install Nvidia's 375.51 driver along with two of its requirements, GCC and the Kernel development package which provides the headers needed to compile Kernel modules.
$ sudo yum install gcc kernel-devel-`uname -r`
$ curl -O http://us.download.nvidia.com/XFree86/Linux-x86_64/375.51/NVIDIA-Linux-x86_64-375.51.run
$ sudo /bin/bash ./NVIDIA-Linux-x86_64-375.51.run
As per Amazon's recommendations, I'm going to switch the Nvidia driver into persistent mode, turn off auto-boost and set the GPU's compute clock speed to 2,505 MHz and the GPU's memory clock speed to 875 MHz.
$ sudo nvidia-smi -pm 1
$ sudo nvidia-smi --auto-boost-default=0
$ sudo nvidia-smi -ac 2505,875
MapD is commercial software so I cannot disclose the full URL I've downloaded the self-extracting archive from. Nonetheless, these are the steps needed to install MapD. If you're interested in your own copy of MapD info@mapd.com should be able to help.
$ mkdir -p ~/prod/installs
$ cd ~/prod/installs
$ curl -O <server>/mapd2-master-Linux-x86_64.sh
$ chmod +x mapd2-master-Linux-x86_64.sh
$ ./mapd2-master-Linux-x86_64.sh
$ ln -s ~/prod/installs/mapd-3.0.0dev-20170419-b8623c9-Linux-x86_64/ \
~/prod/mapd
I'll then add MapD to the PATH environment variable and the Java installation to the library path.
$ export PATH=/home/ec2-user/prod/mapd/bin:$PATH
$ export LD_LIBRARY_PATH=/usr/lib/jvm/jre/lib/amd64/server:$LD_LIBRARY_PATH
On the first EC2 instance I'll create some folders and setup the system's cluster configuration file with the following contents.
$ mkdir -p ~/prod/mapd-storage/{0,1}/data
$ vi ~/prod/mapd-storage/cluster.conf
[
{
"host": "node1",
"port": 9091,
"role": "dbleaf"
},
{
"host": "node2",
"port": 9091,
"role": "dbleaf"
},
{
"host": "node2",
"port": 10301,
"role": "string"
}
]
I'll set the leaf node configuration to the following.
$ vi ~/prod/mapd-storage/0/mapd.conf
port = 9091
http-port = 9090
data = "/home/ec2-user/prod/mapd-storage/0/data"
read-only = false
quiet = false
string-servers = "/home/ec2-user/prod/mapd-storage/cluster.conf"
[web]
port = 9092
frontend = "/home/ec2-user/prod/mapd/frontend"
I'll set the aggregator node configuration to the following.
$ vi ~/prod/mapd-storage/1/mapd.conf
port = 19091
http-port = 19090
data = "/home/ec2-user/prod/mapd-storage/1/data"
read-only = false
quiet = false
cluster = "/home/ec2-user/prod/mapd-storage/cluster.conf"
[web]
port = 19092
frontend = "/home/ec2-user/prod/mapd/frontend"
I'll then initialise the leaf and aggregator databases and launch their respective services.
$ cd ~/prod/mapd-storage/0
$ initdb data
$ mapd_server --config mapd.conf
$ cd ~/prod/mapd-storage/1
$ initdb data
$ mapd_server --config mapd.conf
On the second EC2 instance I'll setup the strings dictionary server configuration.
$ mkdir -p ~/prod/mapd-storage/{strings,data}
$ vi ~/prod/mapd-storage/mapd-sds.conf
port = 10301
path = "/home/ec2-user/prod/mapd-storage/strings"
I'll set the leaf node configuration to the following.
$ vi ~/prod/mapd-storage/mapd.conf
port = 9091
http-port = 9090
data = "/home/ec2-user/prod/mapd-storage/data"
read-only = false
quiet = false
string-servers = "/home/ec2-user/prod/mapd-storage/cluster.conf"
[web]
port = 9092
frontend = "/home/ec2-user/prod/mapd/frontend"
I'll set the second node's cluster.conf file to the following.
$ vi ~/prod/mapd-storage/cluster.conf
[
{
"host": "node1",
"port": 9091,
"role": "dbleaf"
},
{
"host": "node2",
"port": 9091,
"role": "dbleaf"
},
{
"host": "node2",
"port": 10301,
"role": "string"
}
]
I'll then initialise the leaf node database and launch the two services.
$ cd ~/prod/mapd-storage
$ initdb data
$ mapd_server --config mapd.conf
$ StringDictionaryServer --config mapd-sds.conf
Importing 1.1 Billion Trips Into MapD
The following is the schema for my taxi trips table. The fragment size, which tells MapD how many records to spread across each GPU, will be set to 85 million for this exercise.
$ vi create_trips_table.sql
CREATE TABLE trips (
trip_id INTEGER,
vendor_id VARCHAR(3) ENCODING DICT,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1) ENCODING DICT,
rate_code_id SMALLINT,
pickup_longitude DECIMAL(14,2),
pickup_latitude DECIMAL(14,2),
dropoff_longitude DECIMAL(14,2),
dropoff_latitude DECIMAL(14,2),
passenger_count SMALLINT,
trip_distance DECIMAL(14,2),
fare_amount DECIMAL(14,2),
extra DECIMAL(14,2),
mta_tax DECIMAL(14,2),
tip_amount DECIMAL(14,2),
tolls_amount DECIMAL(14,2),
ehail_fee DECIMAL(14,2),
improvement_surcharge DECIMAL(14,2),
total_amount DECIMAL(14,2),
payment_type VARCHAR(3) ENCODING DICT,
trip_type SMALLINT,
pickup VARCHAR(50) ENCODING DICT,
dropoff VARCHAR(50) ENCODING DICT,
cab_type VARCHAR(6) ENCODING DICT,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10) ENCODING DICT,
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13) ENCODING DICT,
pickup_ct2010 VARCHAR(6) ENCODING DICT,
pickup_boroct2010 VARCHAR(7) ENCODING DICT,
pickup_cdeligibil VARCHAR(1) ENCODING DICT,
pickup_ntacode VARCHAR(4) ENCODING DICT,
pickup_ntaname VARCHAR(56) ENCODING DICT,
pickup_puma VARCHAR(4) ENCODING DICT,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10) ENCODING DICT,
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13) ENCODING DICT,
dropoff_ct2010 VARCHAR(6) ENCODING DICT,
dropoff_boroct2010 VARCHAR(7) ENCODING DICT,
dropoff_cdeligibil VARCHAR(1) ENCODING DICT,
dropoff_ntacode VARCHAR(4) ENCODING DICT,
dropoff_ntaname VARCHAR(56) ENCODING DICT,
dropoff_puma VARCHAR(4) ENCODING DICT
) WITH (FRAGMENT_SIZE=85000000);
I'll create an environment variable with my credentials for MapD.
$ read MAPD_PASSWORD
$ export MAPD_PASSWORD
I'll then execute the above SQL script through the aggregator so that the table is created on both leaf nodes.
$ mapdql -p $MAPD_PASSWORD \
--port 19091 \
< create_trips_table.sql
I'll be cutting the dataset roughly in half with each node holding around 50% of the records each. Each node has a distinctive version of a load.sql script where they load different parts of the 56-file dataset.
This is the script for the first node.
$ cat load1.sql
COPY trips FROM '/home/ec2-user/csvData/trips_xa*.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xca.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcb.csv' WITH (header='false');
This is the script for the second node.
$ cat load2.sql
COPY trips FROM '/home/ec2-user/csvData/trips_xb*.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcc.csv' WITH (header='false');
COPY trips FROM '/home/ec2-user/csvData/trips_xcd.csv' WITH (header='false');
To avoid going through the aggregator and having it become a bottleneck, the following commands were coordinated from the first server and executed independently on each of the individual leaf nodes.
$ time mapdql -p $MAPD_PASSWORD --server 127.0.0.1 < load1.sql &
$ time mapdql -p $MAPD_PASSWORD --server 172.31.19.152 < load2.sql &
The first node loaded its dataset in 21 minutes and 58 seconds and the second node loaded its dataset in 26 minutes and 24 seconds.
The above load times are where the clustered approach really begins to shine. Last June I did a benchmark with 8 Nvidia Tesla K80s on a single machine and it took 51 minutes and 47 seconds to import this dataset. It's great to see these two machines are able to break apart this job and complete it in roughly half the time.
With the data loaded, I can connect via the aggregator and see that all 1.1 billion records are visible.
$ mapdql -p $MAPD_PASSWORD \
--port 19091
mapdql> select count(*) from trips;
EXPR$0
1113653018
Benchmarking MapD 3.0
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
$ mapd/bin/mapdql \
mapd \
-u $MAPD_USERNAME \
-p $MAPD_PASSWORD
\timing on
The following completed in 0.034 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 0.061 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 0.178 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
count(*)
FROM trips
GROUP BY passenger_count,
pickup_year;
The following completed in 0.498 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
cast(trip_distance as int) AS distance,
count(*) AS the_count
FROM trips
GROUP BY passenger_count,
pickup_year,
distance
ORDER BY pickup_year,
the_count desc;
Although the above results place this benchmark in 4th place on my benchmarks recap board it's worth mentioning the recap board is sorted by the fastest query 1 time. Query 2's time out-performed all but one other benchmark and query 4's time is the fastest ever. Query 4 seems to scale linearly with the number of nodes in the cluster as I've run this query in the past on a single machine with the same number of K80s and execution took almost twice as long.
When I first heard MapD would natively support clustering I wondered what impact network overhead would have with their implementation. It's great to see MapD still optimises data locality extremely well and network chatter doesn't seem to have a noticeable impact on query times.
If I had launched the cluster using spot instances I would have spent $3.20 / hour. I'm pretty sure this is the best price for performance I've seen in any of my benchmarks.
This benchmark ran on what is now a two-generations-old micro-architecture from Nvidia. I can't imagine what kind of numbers would been seen on their Pascal-powered P100 GPUs nor what times would look like if I'd used more than two EC2 instances. The performance results are very impressive to say the least but what's more impressive is that I know there is a lot more performance yet to be unlocked.