In this blog post I'll take a look at launching a 50-node Dataproc cluster and see if I can achieve query times that approach those seen using Google's BigQuery.
50-Node Cluster Up & Running
To start I requested two quota increases for my Google Cloud account. The first was to be able to run 200 CPU cores and the second was for 50 'in use addresses'.
When launching the cluster there will be 50 machines in total used. These are broken down by: 1 master node, 2 primary workers which will act as data nodes, and 47 preemptible workers. Preemptible workers are discounted up to 70% off the regular instance price.
Note that the worker-boot-disk-size of 500 GB will only apply to the two primary worker nodes, the 47 preemptible secondary worker nodes will have 100 GB of space each.
$ gcloud dataproc clusters \ create trips \ --zone europe-west1-b \ --master-machine-type n1-standard-4 \ --master-boot-disk-size 500 \ --num-preemptible-workers 47 \ --worker-machine-type n1-standard-4 \ --worker-boot-disk-size 500 \ --scopes 'https://www.googleapis.com/auth/cloud-platform' \ --project taxis-1273 \ --initialization-actions 'gs://taxi-trips/presto.sh'
For notes on the bootstrap script and other parameters used please see my Billion Taxi Rides on Google's Dataproc running Presto blog post.
After executing the above command the cluster was up and running within two minutes. Once up and running I was able to SSH into the master node.
$ gcloud compute ssh \ trips-m \ --zone europe-west1-b
1.1 Billion Records on HDFS
I've taken the metadata of 1.1 billion taxi trips in New York City and converted them into denormalised, GZIP-compressed, CSV files. The steps involved can be found in my Billion Taxi Rides in Redshift blog post. In my 33x Faster Queries on Google Cloud's Dataproc blog post I then found a combination of file storage settings which offered extremely quick query performance on this dataset. This resulted in a set of files that are stored in ORC format using Snappy compression, 50K index strides and 512 MB index stripes. This dataset is stored on Google Cloud Storage. I'll run the following command to copy that dataset onto HDFS.
$ hadoop distcp \ gs://taxi-trips/orc_snappy_50k_512mb/ /
The above took 13 minutes to complete.
I'll then create a table in Hive that will represent this data.
CREATE EXTERNAL TABLE trips ( trip_id INT, vendor_id STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, store_and_fwd_flag STRING, rate_code_id SMALLINT, pickup_longitude DOUBLE, pickup_latitude DOUBLE, dropoff_longitude DOUBLE, dropoff_latitude DOUBLE, passenger_count SMALLINT, trip_distance DOUBLE, fare_amount DOUBLE, extra DOUBLE, mta_tax DOUBLE, tip_amount DOUBLE, tolls_amount DOUBLE, ehail_fee DOUBLE, improvement_surcharge DOUBLE, total_amount DOUBLE, payment_type STRING, trip_type SMALLINT, pickup STRING, dropoff STRING, cab_type STRING, precipitation SMALLINT, snow_depth SMALLINT, snowfall SMALLINT, max_temperature SMALLINT, min_temperature SMALLINT, average_wind_speed SMALLINT, pickup_nyct2010_gid SMALLINT, pickup_ctlabel STRING, pickup_borocode SMALLINT, pickup_boroname STRING, pickup_ct2010 STRING, pickup_boroct2010 STRING, pickup_cdeligibil STRING, pickup_ntacode STRING, pickup_ntaname STRING, pickup_puma STRING, dropoff_nyct2010_gid SMALLINT, dropoff_ctlabel STRING, dropoff_borocode SMALLINT, dropoff_boroname STRING, dropoff_ct2010 STRING, dropoff_boroct2010 STRING, dropoff_cdeligibil STRING, dropoff_ntacode STRING, dropoff_ntaname STRING, dropoff_puma STRING ) STORED AS orc LOCATION '/orc_snappy_50k_512mb/';
Benchmarking a 50-Node Dataproc Cluster
For this benchmark I'm running Presto version 0.144.1.
$ presto \ --catalog hive \ --schema default
The following completed in 4 seconds.
SELECT cab_type, count(*) FROM trips GROUP BY cab_type;
Query 20160505_062625_00004_gr6cv, FINISHED, 48 nodes Splits: 2,053 total, 2,053 done (100.00%) 0:04 [1.13B rows, 19MB] [278M rows/s, 4.64MB/s]
The following completed in 4 seconds.
SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;
Query 20160505_062909_00012_gr6cv, FINISHED, 49 nodes Splits: 2,053 total, 2,053 done (100.00%) 0:04 [1.13B rows, 3.48GB] [286M rows/s, 900MB/s]
The following completed in 10 seconds.
SELECT passenger_count, year(pickup_datetime), count(*) FROM trips GROUP BY passenger_count, year(pickup_datetime);
Query 20160505_063109_00018_gr6cv, FINISHED, 49 nodes Splits: 2,053 total, 2,053 done (100.00%) 0:10 [1.13B rows, 4.73GB] [111M rows/s, 474MB/s]
The following completed in 21 seconds.
SELECT passenger_count, year(pickup_datetime) trip_year, round(trip_distance), count(*) trips FROM trips GROUP BY passenger_count, year(pickup_datetime), round(trip_distance) ORDER BY trip_year, trips desc;
Query 20160505_063251_00022_gr6cv, FINISHED, 49 nodes Splits: 2,053 total, 2,053 done (100.00%) 0:21 [1.13B rows, 8.12GB] [54.2M rows/s, 397MB/s]
BigQuery set the benchmark for query speeds, cost and simplicity of setup. Though this experiment set me back a little less than $2.50 it was pretty straightforward to setup and the 4 second query times are approaching the 2 second query times I saw with BigQuery. Presto, Dataproc and my Hadoop-related configurations continue to evolve so there could be more performance to be found yet.