50-node Presto Cluster on Amazon EMR

Recently I wrote a blog post on using Presto and Amazon EMR to query the metadata of 1.1 billion taxi trips made in New York City over the space of six years. In that post I used a 5-node cluster and got back some good performance numbers.

I think the best feature demonstrated was that you can separate your data from your compute resources using AWS EMR and S3. Since that post I've been curious to see if simply increasing the EC2 node count would make a significant impact on query times.

In this post I'll launch a 50-node Presto cluster on Amazon EMR and see how well it performs.

AWS CLI Up and Running

All the following commands were run on a fresh install of Ubuntu 14.04.3.

To start, I'll install the AWS CLI tool and a few dependencies it needs to run.

$ sudo apt update
$ sudo apt install \
    python-pip \
    python-virtualenv

$ virtualenv amazon
$ source amazon/bin/activate
$ pip install awscli

I'll then enter my AWS credentials.

$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY

104 GB of Data on S3

The data I'm working with was originally put together from their original data sources with scripts written by Todd W Schneider. In my Billion Taxi Rides in Redshift blog post I exported the data from PostgreSQL into denormalised CSV files and compressed them into 56 gzip files.

In my original Presto on EMR blog post I converted the CSV data into ORC format and stored them on S3. I'll be using those same ORC files on S3 in this blog post.

50-node EMR Cluster Up and Running

If you haven't launched an EMR cluster before then I suggest doing so via the AWS Console. Make sure you turn on advanced options so you can pick spot instances instead of on-demand instances for your core and task nodes. When you use the EMR wizard it'll create the roles and security groups needed for your cluster nodes to communicate with one another.

Before you run the command below generate a key pair called "emr". Download and store the emr.pem file in the ~/.ssh/ folder on your system. This file is the SSH private key that lets you access the Hadoop cluster nodes via SSH.

Below you will need to change a few items before executing the create-cluster command:

If the key name you generated was called anything other than emr then change the KeyName attribute.
The InstanceProfile is set to the name the wizard automatically generated. If you haven't used the wizard before then make sure this value matches your role.
The --region and AvailabilityZone parameters are set to eu-west-1a, change this if this region isn't the most appropriate for you. This region should match the region where you've stored the denormalised CSV data on S3.
The log-uri parameter's bucket name needs to be changed to another bucket name that hasn't already been taken.
The BidPrice amounts are in US Dollars and were appropriate at the time I was bidding for them. Please see the current spot prices for m3.xlarge instances in your region and adjust these accordingly. There is a pricing history button on the EC2 Spot Requests page.
The master node needs to stay up if the cluster is to survive so it's an on-demand instance instead of a spot instance. It's also an m3.xlarge which can be considered overkill for the tasks it performs. Consider using a less expensive and less powerful EC2 instance type.
As of this writing version 4.4.0 is the latest release of EMR. If a newer version is released I suggest considering using it instead as they tend to bring a lot of improvements to each release.

$ aws emr create-cluster \
  --applications \
    Name=Hadoop \
    Name=Hive \
    Name=Presto-Sandbox \
  --ec2-attributes '{
      "KeyName": "emr",
      "InstanceProfile": "EMR_EC2_DefaultRole",
      "AvailabilityZone": "eu-west-1a",
      "EmrManagedSlaveSecurityGroup": "sg-89cd3eff",
      "EmrManagedMasterSecurityGroup": "sg-d4cc3fa2"
  }' \
  --service-role EMR_DefaultRole \
  --release-label emr-4.4.0 \
  --log-uri 's3n://aws-logs-591231097547-eu-west-1/elasticmapreduce/' \
  --name 'A Billion Taxi Trips' \
  --instance-groups '[{
      "InstanceCount": 2,
      "BidPrice": "0.048",
      "InstanceGroupType": "CORE",
      "InstanceType": "m3.xlarge",
      "Name": "Core instance group - 2"
  }, {
      "InstanceCount": 47,
      "BidPrice": "0.048",
      "InstanceGroupType": "TASK",
      "InstanceType": "m3.xlarge",
      "Name": "Task instance group - 3"
  }, {
      "InstanceCount": 1,
      "InstanceGroupType": "MASTER",
      "InstanceType": "m3.xlarge",
      "Name": "Master instance group - 1"
  }]' \
  --region eu-west-1

The master and core nodes were provisioned and bootstrapped in a few minutes and within 25 minutes of issuing this command all the task nodes were up and running.

Note, depending on which tools you use in the Hadoop stack the core and task nodes will play varying roles. With Presto however, the master node will be the coordinator and all other nodes will be worker nodes.

AWS Costs

I'm using 49 spot instances that cost at most $0.048 / hour and an on-demand EC2 instance for my master node which runs $0.293 / hour bringing my costs at most to $2.645 / hour.

This is 5.45x more than the cost of the 5-node cluster in my original Presto on EMR blog post.

Accessing the Master Node

Change the EC2 hostname below to the hostname of your master node.

$ ssh -i ~/.ssh/emr.pem \
    hadoop@ec2-54-78-244-152.eu-west-1.compute.amazonaws.com

Creating Tables in the Hive Metastore

Hive is used to contain the metadata around the data files on S3 and their schema. Presto relies on Hive's Metastore for this information.

I have the 1.1 billion records stored as ORC files on S3. If you are re-running these steps change the LOCATION parameter at the end of the CREATE TABLE statement to your own bucket on S3.

$ hive

CREATE EXTERNAL TABLE trips_orc (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION 's3://<s3_bucket>/orc/';

Benchmarking Queries in Presto

The following queries ran on Presto 0.136, the version of Presto that EMR 4.4.0 ships with.

The two key metrics I'm looking at in the benchmark here is how much faster these queries perform against the original cluster and the second is the cost efficiency. I'm hoping to see the queries are at least 5.45x faster to be in line with the additional cost.

presto-cli \
    --catalog hive \
    --schema default

The following completed in 43 seconds (1.74x faster, 3x less cost efficient).

SELECT cab_type,
       count(*)
FROM trips_orc
GROUP BY cab_type;

Query 20160408_102901_00003_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:43 [1.11B rows, 48.8GB] [26.1M rows/s, 1.14GB/s]

The following completed in 45 seconds (1.4x faster, 3.9x less cost efficient).

SELECT passenger_count,
       avg(total_amount)
FROM trips_orc
GROUP BY passenger_count;

Query 20160408_103140_00005_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:45 [1.11B rows, 48.8GB] [24.6M rows/s, 1.08GB/s]

The following completed in 27 seconds (3x faster, 1.8x less cost efficient).

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips_orc
GROUP BY passenger_count,
         year(pickup_datetime);

Query 20160408_103248_00006_hbwkj, RUNNING, 49 nodes
Splits: 753 total, 750 done (99.60%)
0:27 [1.11B rows, 48.8GB] [41.4M rows/s, 1.81GB/s]

The following completed in 44 seconds (2.4x faster, 2.2x less cost efficient).

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

Query 20160408_103335_00007_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:44 [1.11B rows, 48.8GB] [25.1M rows/s, 1.1GB/s]

As you can see there is a performance increase but none of the queries ran more than 3x faster despite all the extra nodes. I'm investigating what can be done to bring performance in line with the cost increase.

If anyone has any insights I'd be grateful if you could drop me a line on Twitter on via email (my email address is in the top left corner of my CV).

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.