50-node Presto Cluster on Google Cloud's Dataproc

In this blog post I'll take a look at launching a 50-node Dataproc cluster and see if I can achieve query times that approach those seen using Google's BigQuery.

50-Node Cluster Up & Running

To start I requested two quota increases for my Google Cloud account. The first was to be able to run 200 CPU cores and the second was for 50 'in use addresses'.

When launching the cluster there will be 50 machines in total used. These are broken down by: 1 master node, 2 primary workers which will act as data nodes, and 47 preemptible workers. Preemptible workers are discounted up to 70% off the regular instance price.

Note that the worker-boot-disk-size of 500 GB will only apply to the two primary worker nodes, the 47 preemptible secondary worker nodes will have 100 GB of space each.

$ gcloud dataproc clusters \
      create trips \
      --zone europe-west1-b \
      --master-machine-type n1-standard-4 \
      --master-boot-disk-size 500 \
      --num-preemptible-workers 47 \
      --worker-machine-type n1-standard-4 \
      --worker-boot-disk-size 500 \
      --scopes 'https://www.googleapis.com/auth/cloud-platform' \
      --project taxis-1273 \
      --initialization-actions 'gs://taxi-trips/presto.sh'

For notes on the bootstrap script and other parameters used please see my Billion Taxi Rides on Google's Dataproc running Presto blog post.

After executing the above command the cluster was up and running within two minutes. Once up and running I was able to SSH into the master node.

$ gcloud compute ssh \
    trips-m \
    --zone europe-west1-b

1.1 Billion Records on HDFS

I've taken the metadata of 1.1 billion taxi trips in New York City and converted them into denormalised, GZIP-compressed, CSV files. The steps involved can be found in my Billion Taxi Rides in Redshift blog post. In my 33x Faster Queries on Google Cloud's Dataproc blog post I then found a combination of file storage settings which offered extremely quick query performance on this dataset. This resulted in a set of files that are stored in ORC format using Snappy compression, 50K index strides and 512 MB index stripes. This dataset is stored on Google Cloud Storage. I'll run the following command to copy that dataset onto HDFS.

$ hadoop distcp \
    gs://taxi-trips/orc_snappy_50k_512mb/ /

The above took 13 minutes to complete.

I'll then create a table in Hive that will represent this data.

$ hive

CREATE EXTERNAL TABLE trips (
    trip_id                 INT,
    vendor_id               STRING,
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      STRING,
    rate_code_id            SMALLINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         SMALLINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            STRING,
    trip_type               SMALLINT,
    pickup                  STRING,
    dropoff                 STRING,

    cab_type                STRING,

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          STRING,
    pickup_borocode         SMALLINT,
    pickup_boroname         STRING,
    pickup_ct2010           STRING,
    pickup_boroct2010       STRING,
    pickup_cdeligibil       STRING,
    pickup_ntacode          STRING,
    pickup_ntaname          STRING,
    pickup_puma             STRING,

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         STRING,
    dropoff_borocode        SMALLINT,
    dropoff_boroname        STRING,
    dropoff_ct2010          STRING,
    dropoff_boroct2010      STRING,
    dropoff_cdeligibil      STRING,
    dropoff_ntacode         STRING,
    dropoff_ntaname         STRING,
    dropoff_puma            STRING
) STORED AS orc
  LOCATION '/orc_snappy_50k_512mb/';

Benchmarking a 50-Node Dataproc Cluster

For this benchmark I'm running Presto version 0.144.1.

$ presto \
    --catalog hive \
    --schema default

The following completed in 4 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY cab_type;

Query 20160505_062625_00004_gr6cv, FINISHED, 48 nodes
Splits: 2,053 total, 2,053 done (100.00%)
0:04 [1.13B rows, 19MB] [278M rows/s, 4.64MB/s]

The following completed in 4 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY passenger_count;

Query 20160505_062909_00012_gr6cv, FINISHED, 49 nodes
Splits: 2,053 total, 2,053 done (100.00%)
0:04 [1.13B rows, 3.48GB] [286M rows/s, 900MB/s]

The following completed in 10 seconds.

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips
GROUP BY passenger_count,
         year(pickup_datetime);

Query 20160505_063109_00018_gr6cv, FINISHED, 49 nodes
Splits: 2,053 total, 2,053 done (100.00%)
0:10 [1.13B rows, 4.73GB] [111M rows/s, 474MB/s]

The following completed in 21 seconds.

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

Query 20160505_063251_00022_gr6cv, FINISHED, 49 nodes
Splits: 2,053 total, 2,053 done (100.00%)
0:21 [1.13B rows, 8.12GB] [54.2M rows/s, 397MB/s]

BigQuery set the benchmark for query speeds, cost and simplicity of setup. Though this experiment set me back a little less than $2.50 it was pretty straightforward to setup and the 4 second query times are approaching the 2 second query times I saw with BigQuery. Presto, Dataproc and my Hadoop-related configurations continue to evolve so there could be more performance to be found yet.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.