A Billion Taxi Rides: AWS S3 versus HDFS

In 2014 Saggi Neumann published an excellent article comparing the merits of AWS S3 and HDFS. He pointed out that he'd seen better throughput with HDFS on ephemeral storage than with S3. In this article I'll see how much faster queries executed via Presto on a small EMR cluster are against HDFS-stored data versus S3-stored data.

AWS CLI Up and Running

All the following commands were run on a fresh install of Ubuntu 14.04.3.

To start, I'll install the AWS CLI tool and a few dependencies it needs to run.

$ sudo apt update
$ sudo apt install \
    python-pip \
    python-virtualenv

$ virtualenv amazon
$ source amazon/bin/activate
$ pip install awscli

I'll then enter my AWS credentials.

$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY

5-node EMR Cluster Up and Running

I'll be launching a 5-node cluster of m3.xlarge instances using the 4.5.0 AMI. This AMI comes with Presto 1.4.0 and Hive 1.0.0. For a detailed run down on the parameters provided below please see these notes.

$ aws emr create-cluster \
      --applications \
        Name=Hadoop \
        Name=Hive \
        Name=Presto-Sandbox \
      --ec2-attributes '{
          "KeyName": "emr",
          "InstanceProfile": "EMR_EC2_DefaultRole",
          "AvailabilityZone": "eu-west-1a",
          "EmrManagedSlaveSecurityGroup": "sg-89cd3eff",
          "EmrManagedMasterSecurityGroup": "sg-d4cc3fa2"
      }' \
      --service-role EMR_DefaultRole \
      --release-label emr-4.5.0 \
      --log-uri 's3n://aws-logs-591231097547-eu-west-1/elasticmapreduce/' \
      --name 'A Billion Taxi Trips' \
      --instance-groups '[{
          "InstanceCount": 2,
          "BidPrice": "0.048",
          "InstanceGroupType": "CORE",
          "InstanceType": "m3.xlarge",
          "Name": "Core instance group - 2"
      }, {
          "InstanceCount": 2,
          "BidPrice": "0.048",
          "InstanceGroupType": "TASK",
          "InstanceType": "m3.xlarge",
          "Name": "Task instance group - 3"
      }, {
          "InstanceCount": 1,
          "InstanceGroupType": "MASTER",
          "InstanceType": "m3.xlarge",
          "Name": "Master instance group - 1"
      }]' \
      --region eu-west-1

After 20 minutes the machines had all been provisioned, bootstrapped and I was able to SSH into the master node.

$ ssh -i ~/.ssh/emr.pem \
      hadoop@ec2-54-155-155-115.eu-west-1.compute.amazonaws.com

48 GB of Data on S3 and HDFS

I'll be using the 192 compressed ORC files I generated in my Billion Taxi Rides on Amazon EMR running Presto blog post. I keep these files stored on S3 and have used them for various benchmarks in the past.

$ aws s3 ls s3://<s3_bucket>/orc/

2016-03-14 13:54:41  398631347 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000000
2016-03-14 13:54:40  393489828 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000001
...
2016-03-14 14:00:23  265076102 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000190
2016-03-14 14:00:12  115110402 1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000191

I'll execute distcp to copy these files into HDFS.

$ hadoop distcp s3://<s3_bucket>/orc/ /

The above took 5 minutes and 16 seconds to complete. Not a bad overhead in my opinion.

The following are the counters reported.

File System Counters
        FILE: Number of bytes read=0
        FILE: Number of bytes written=2803656
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=46506
        HDFS: Number of bytes written=52457486697
        HDFS: Number of read operations=1521
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=430
        S3: Number of bytes read=52457486697
        S3: Number of bytes written=0
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
Job Counters
        Launched map tasks=22
        Other local map tasks=22
        Total time spent by all maps in occupied slots (ms)=159285792
        Total time spent by all reduces in occupied slots (ms)=0
        Total time spent by all map tasks (ms)=4977681
        Total vcore-milliseconds taken by all map tasks=4977681
        Total megabyte-milliseconds taken by all map tasks=5097145344
Map-Reduce Framework
        Map input records=193
        Map output records=0
        Input split bytes=3014
        Spilled Records=0
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=17274
        CPU time spent (ms)=1436840
        Physical memory (bytes) snapshot=8793264128
        Virtual memory (bytes) snapshot=45056073728
        Total committed heap usage (bytes)=6723469312
File Input Format Counters
        Bytes Read=43492
File Output Format Counters
        Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
        BYTESCOPIED=52457486697
        BYTESEXPECTED=52457486697
        COPY=193

After the files were copied to HDFS I made sure they were properly in place. I abbreviated the output to the file paths and their respective sizes.

$ hdfs dfs -ls /orc/

Found 192 items
...  398631347 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000000
...  393489828 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000001
...
...  265076102 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000190
...  115110402 ... /orc/1dcb5447-87d5-4418-b7fb-455e4e39e5f6-000191

Creating Tables in the Hive Metastore

I'm going to create two tables using Hive to represent both data locations. trips_orc_s3 will point to the data on S3 and trips_orc_hdfs will point to the data on HDFS.

$ hive

CREATE EXTERNAL TABLE trips_orc_s3 (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION 's3://<s3_bucket>/orc/';

CREATE EXTERNAL TABLE trips_orc_hdfs (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION '/orc/';

Benchmarking S3

I'll be using Presto 1.4.0 to benchmark both the S3 and the HDFS-based data. The following shows the queries times I saw for the S3-based data.

$ presto-cli \
    --catalog hive \
    --schema default

The following completed in 1 minute and 1 second.

SELECT cab_type,
       count(*)
FROM trips_orc_s3
GROUP BY cab_type;

Query 20160414_110943_00006_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:01 [1.11B rows, 48.8GB] [18.3M rows/s, 821MB/s]

The following completed in 58 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips_orc_s3
GROUP BY passenger_count;

Query 20160414_111100_00007_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
0:58 [1.11B rows, 48.8GB] [19.1M rows/s, 859MB/s]

The following completed in 1 minute and 43 seconds.

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips_orc_s3
GROUP BY passenger_count,
         year(pickup_datetime);

Query 20160414_111213_00008_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:43 [1.11B rows, 48.8GB] [10.8M rows/s, 484MB/s]

The following completed in 1 minute and 41 seconds.

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc_s3
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

Query 20160414_111407_00009_hzqcy, FINISHED, 4 nodes
Splits: 749 total, 749 done (100.00%)
1:41 [1.11B rows, 48.8GB] [11.1M rows/s, 497MB/s]

Benchmarking HDFS

The following shows the query times I saw for the HDFS-based data.

The following completed in 35 seconds (1.75x faster than S3).

SELECT cab_type,
       count(*)
FROM trips_orc_hdfs
GROUP BY cab_type;

Query 20160414_110514_00002_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
0:35 [1.11B rows, 48.8GB] [31.9M rows/s, 1.4GB/s]

The following completed in 39 seconds (1.5x faster than S3).

SELECT passenger_count,
       avg(total_amount)
FROM trips_orc_hdfs
GROUP BY passenger_count;

Query 20160414_110607_00003_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
0:39 [1.11B rows, 48.8GB] [28.2M rows/s, 1.24GB/s]

The following completed in 1 minute and 4 seconds (1.6x faster than S3).

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips_orc_hdfs
GROUP BY passenger_count,
         year(pickup_datetime);

Query 20160414_110658_00004_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
1:04 [1.11B rows, 48.8GB] [17.5M rows/s, 786MB/s]

The following completed in 1 minute and 21 seconds (1.25x faster than S3).

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips_orc_hdfs
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

Query 20160414_110810_00005_hzqcy, FINISHED, 4 nodes
Splits: 793 total, 793 done (100.00%)
1:21 [1.11B rows, 48.8GB] [13.7M rows/s, 617MB/s]

Though the speed improvements using HDFS are considerable, S3 did perform pretty well. At worst there's a 1.75x overhead in exchange for virtually unlimited scalability, 11 9's of durability and no worrying about over/under-provisioning storage space.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.