Recently I wrote a blog post on using Presto and Amazon EMR to query the metadata of 1.1 billion taxi trips made in New York City over the space of six years. In that post I used a 5-node cluster and got back some good performance numbers.
I think the best feature demonstrated was that you can separate your data from your compute resources using AWS EMR and S3. Since that post I've been curious to see if simply increasing the EC2 node count would make a significant impact on query times.
In this post I'll launch a 50-node Presto cluster on Amazon EMR and see how well it performs.
AWS CLI Up and Running
All the following commands were run on a fresh install of Ubuntu 14.04.3.
To start, I'll install the AWS CLI tool and a few dependencies it needs to run.
$ sudo apt update
$ sudo apt install \
python-pip \
python-virtualenv
$ virtualenv amazon
$ source amazon/bin/activate
$ pip install awscli
I'll then enter my AWS credentials.
$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY
104 GB of Data on S3
The data I'm working with was originally put together from their original data sources with scripts written by Todd W Schneider. In my Billion Taxi Rides in Redshift blog post I exported the data from PostgreSQL into denormalised CSV files and compressed them into 56 gzip files.
In my original Presto on EMR blog post I converted the CSV data into ORC format and stored them on S3. I'll be using those same ORC files on S3 in this blog post.
50-node EMR Cluster Up and Running
If you haven't launched an EMR cluster before then I suggest doing so via the AWS Console. Make sure you turn on advanced options so you can pick spot instances instead of on-demand instances for your core and task nodes. When you use the EMR wizard it'll create the roles and security groups needed for your cluster nodes to communicate with one another.
Before you run the command below generate a key pair called "emr". Download and store the emr.pem file in the ~/.ssh/ folder on your system. This file is the SSH private key that lets you access the Hadoop cluster nodes via SSH.
Below you will need to change a few items before executing the create-cluster command:
- If the key name you generated was called anything other than emr then change the KeyName attribute.
- The InstanceProfile is set to the name the wizard automatically generated. If you haven't used the wizard before then make sure this value matches your role.
- The --region and AvailabilityZone parameters are set to eu-west-1a, change this if this region isn't the most appropriate for you. This region should match the region where you've stored the denormalised CSV data on S3.
- The log-uri parameter's bucket name needs to be changed to another bucket name that hasn't already been taken.
- The BidPrice amounts are in US Dollars and were appropriate at the time I was bidding for them. Please see the current spot prices for m3.xlarge instances in your region and adjust these accordingly. There is a pricing history button on the EC2 Spot Requests page.
- The master node needs to stay up if the cluster is to survive so it's an on-demand instance instead of a spot instance. It's also an m3.xlarge which can be considered overkill for the tasks it performs. Consider using a less expensive and less powerful EC2 instance type.
- As of this writing version 4.4.0 is the latest release of EMR. If a newer version is released I suggest considering using it instead as they tend to bring a lot of improvements to each release.
$ aws emr create-cluster \
--applications \
Name=Hadoop \
Name=Hive \
Name=Presto-Sandbox \
--ec2-attributes '{
"KeyName": "emr",
"InstanceProfile": "EMR_EC2_DefaultRole",
"AvailabilityZone": "eu-west-1a",
"EmrManagedSlaveSecurityGroup": "sg-89cd3eff",
"EmrManagedMasterSecurityGroup": "sg-d4cc3fa2"
}' \
--service-role EMR_DefaultRole \
--release-label emr-4.4.0 \
--log-uri 's3n://aws-logs-591231097547-eu-west-1/elasticmapreduce/' \
--name 'A Billion Taxi Trips' \
--instance-groups '[{
"InstanceCount": 2,
"BidPrice": "0.048",
"InstanceGroupType": "CORE",
"InstanceType": "m3.xlarge",
"Name": "Core instance group - 2"
}, {
"InstanceCount": 47,
"BidPrice": "0.048",
"InstanceGroupType": "TASK",
"InstanceType": "m3.xlarge",
"Name": "Task instance group - 3"
}, {
"InstanceCount": 1,
"InstanceGroupType": "MASTER",
"InstanceType": "m3.xlarge",
"Name": "Master instance group - 1"
}]' \
--region eu-west-1
The master and core nodes were provisioned and bootstrapped in a few minutes and within 25 minutes of issuing this command all the task nodes were up and running.
Note, depending on which tools you use in the Hadoop stack the core and task nodes will play varying roles. With Presto however, the master node will be the coordinator and all other nodes will be worker nodes.
AWS Costs
I'm using 49 spot instances that cost at most $0.048 / hour and an on-demand EC2 instance for my master node which runs $0.293 / hour bringing my costs at most to $2.645 / hour.
This is 5.45x more than the cost of the 5-node cluster in my original Presto on EMR blog post.
Accessing the Master Node
Change the EC2 hostname below to the hostname of your master node.
$ ssh -i ~/.ssh/emr.pem \
hadoop@ec2-54-78-244-152.eu-west-1.compute.amazonaws.com
Creating Tables in the Hive Metastore
Hive is used to contain the metadata around the data files on S3 and their schema. Presto relies on Hive's Metastore for this information.
I have the 1.1 billion records stored as ORC files on S3. If you are re-running these steps change the LOCATION parameter at the end of the CREATE TABLE statement to your own bucket on S3.
$ hive
CREATE EXTERNAL TABLE trips_orc (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
rate_code_id SMALLINT,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
passenger_count SMALLINT,
trip_distance DOUBLE,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
ehail_fee DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
payment_type STRING,
trip_type SMALLINT,
pickup STRING,
dropoff STRING,
cab_type STRING,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel STRING,
pickup_borocode SMALLINT,
pickup_boroname STRING,
pickup_ct2010 STRING,
pickup_boroct2010 STRING,
pickup_cdeligibil STRING,
pickup_ntacode STRING,
pickup_ntaname STRING,
pickup_puma STRING,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel STRING,
dropoff_borocode SMALLINT,
dropoff_boroname STRING,
dropoff_ct2010 STRING,
dropoff_boroct2010 STRING,
dropoff_cdeligibil STRING,
dropoff_ntacode STRING,
dropoff_ntaname STRING,
dropoff_puma STRING
) STORED AS orc
LOCATION 's3://<s3_bucket>/orc/';
Benchmarking Queries in Presto
The following queries ran on Presto 0.136, the version of Presto that EMR 4.4.0 ships with.
The two key metrics I'm looking at in the benchmark here is how much faster these queries perform against the original cluster and the second is the cost efficiency. I'm hoping to see the queries are at least 5.45x faster to be in line with the additional cost.
presto-cli \
--catalog hive \
--schema default
The following completed in 43 seconds (1.74x faster, 3x less cost efficient).
SELECT cab_type,
count(*)
FROM trips_orc
GROUP BY cab_type;
Query 20160408_102901_00003_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:43 [1.11B rows, 48.8GB] [26.1M rows/s, 1.14GB/s]
The following completed in 45 seconds (1.4x faster, 3.9x less cost efficient).
SELECT passenger_count,
avg(total_amount)
FROM trips_orc
GROUP BY passenger_count;
Query 20160408_103140_00005_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:45 [1.11B rows, 48.8GB] [24.6M rows/s, 1.08GB/s]
The following completed in 27 seconds (3x faster, 1.8x less cost efficient).
SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime);
Query 20160408_103248_00006_hbwkj, RUNNING, 49 nodes
Splits: 753 total, 750 done (99.60%)
0:27 [1.11B rows, 48.8GB] [41.4M rows/s, 1.81GB/s]
The following completed in 44 seconds (2.4x faster, 2.2x less cost efficient).
SELECT passenger_count,
year(pickup_datetime) trip_year,
round(trip_distance),
count(*) trips
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime),
round(trip_distance)
ORDER BY trip_year,
trips desc;
Query 20160408_103335_00007_hbwkj, FINISHED, 49 nodes
Splits: 753 total, 753 done (100.00%)
0:44 [1.11B rows, 48.8GB] [25.1M rows/s, 1.1GB/s]
As you can see there is a performance increase but none of the queries ran more than 3x faster despite all the extra nodes. I'm investigating what can be done to bring performance in line with the cost increase.
If anyone has any insights I'd be grateful if you could drop me a line on Twitter on via email (my email address is in the top left corner of my CV).