1.1 Billion Taxi Rides using Hydrolix on AWS

Hydrolix is a Cloud-based analytics platform. It's reasonably unique in that the system's components have been broken up into their own services that can be scaled up and down as needed. For example, you can scale up batch import peers to any number of EC2 instances and pick bespoke instance types solely for a large, one-off import. When the import is complete, you can shut down these peers while the rest of the cluster's services remain unaffected.

The same goes for streaming in real-time data or running large queries. The underlying core services cost around $30 / month to run but beyond that, you can set the price you're willing to pay ahead of time for elastic workloads. Being able to pick instance types means there is a degree of vertical scaling in addition to horizontal scaling.

The data you store in Hydrolix ends up living in an AWS S3 bucket that you own. They've put a lot of effort into bespoke compression techniques, predicate push-downs and micro-indexing which goes a long way to reducing S3 costs as well as speeding up queries. In using S3 efficiently, the offering goes a long way to removing the need for "hot" and "cold" data management techniques commonly found with fixed-capacity storage systems.

All the above means your capacity plans aren't set in stone and can be changed with a moment's notice.

Hydrolix was designed for immutable, append-only datasets and the SQL API is based on ClickHouse's flavour of SQL and binary wire protocol so Hydrolix is able to piggyback off of ClickHouse support already found in tools like Google's Looker, Superset and Grafana. Features like unlimited dimensions, unlimited cardinality, out-of-order data, and multi-schema tables are all supported.

The team behind Hydrolix is headquartered in Portland, Oregon and have previously worked for data- and networking-centric firms such as Cisco, Fastly, Akamai, Elastic, Splunk and New Relic to name a few.

In this post, I'm going to see how fast Hydrolix can run the 1.1 billion taxi rides benchmark. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, OmniSci, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

Hydrolix Service Types

A Hydrolix cluster is made up of nine different services. They are:

head is a query manager that distributes workloads amongst peers and gathers their results.
query-peer streams data partitions and executes queries.
merge-peer will merge smaller data partitions into larger ones which help improve query performance.
batch-peer ingests batch data from S3. This can happen as a one-off event or continuously via notifications.
stream-head handles ingestion of streaming data, distributing workloads across stream-peers.
stream-peer writes indexed partitions of streamed data.
ui runs a web interface offering a visual configuration interface and user management.
config is an HTTP API endpoint for managing configuration.
grafana is a dashboard service which comes pre-configured for Hydrolix.

An account on Hydrolix can have multiple clusters. The following is an example listing.

$ ./hdxctl clusters

CLIENT_ID        CLUSTER_ID    CREATED              HOST                 STATUS           WHO      REGION
---------------  ------------  -------------------  -------------------  ---------------  -------  ---------
hdxcli-ekmeho6e  hdx-zl72rqld  2021-01-05 08:50:14  try3.hydrolix.live.  CREATE_COMPLETE  jturner  us-east-2

Within any cluster you can see which services you're running, the EC2 instance type and node count. Here is an example output.

$ ./hdxctl scale $CLIENT_ID $CLUSTER_ID

SERVICE/POOL      COUNT  FAMILY    SIZE       DISK
--------------  -------  --------  -------  ------
batch-peer0           1  r5        2xlarge      30
config                1  t2        micro        30
grafana               1  t2        micro        30
head                  1  c5n       xlarge       30
merge-peer0           1  r5        large        30
query-peer0           3  c5n       9xlarge     100
stream-head           1  m5        xlarge       30
stream-peer0          1  m5        xlarge       30
ui                    1  t2        micro        30

A Hydrolix Cluster, Up & Running

I'll first download the Python-based client for Hydrolix and make sure it's executable.

$ wget -O hdxctl https://hdx-infrastructure.s3.amazonaws.com/hdxctl-v2.1.5
$ chmod +x hdxctl

I'll then get a license key from Hydrolix. My Hydrolix account has been configured with the credentials for an AWS sub-account that will be used for storing data and provisioning EC2 instances that run Hydrolix.

$ ./hdxctl get-license \
    –-account_id 123456789012 \
    –-region us-east-2 \
    –-admin-email mark@marksblogg.com \
    –-organization marksblogg \
    --host marksblogg

The above will populate the CLIENT_ID, CLUSTER_ID and ORG_ID environment variables which will be used later on.

In my AWS EC2 sub-account, I've subscribed to the Hydrolix AMI. On that page, there is a pricing matrix explaining the costs and configuration options. This AMI can also be used as apart of a free trial.

The following will launch a new cluster.

$ ./hdxctl create-cluster $CLIENT_ID

A Hydrolix cluster is made up of several services. Initially, their scale is set to zero. The following command will scale them up to the minimal amount needed for a working cluster.

$ ./hdxctl scale --minimal $CLIENT_ID $CLUSTER_ID

Follow the above launch, an email arrived with the initial cluster credentials.

Importing 1.1 Billion Trips Into Hydrolix

The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over six years. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

I'll grant Hydrolix access to an S3 bucket where I've stored the dataset for this benchmark. The AWS CLI has already been installed and configured with my credentials on my machine ahead of time.

$ ./hdxctl update $CLIENT_ID \
        --bucket-allowlist some-bucket

The config service is IPv4-restricted. I'll run the following to allow my IPv4 address access.

$ ./hdxctl update $CLIENT_ID \
        --ip-allowlist "123.456.789.12/32"

I'll then fetch an API token using the credentials I was emailed when I initially launched the cluster.

$ curl \
    --request POST \
    --url https://marksblogg.hydrolix.io/config/v1/login \
    --header 'content-type: application/json' \
    --data '{"username":"mark@marksblogg.com", "password":"<PASSWORD>"}'

The response should look something like the following:

{
  "uuid": "<USER_ID>",
  "orgs": [
    {
      "uuid": "<ORG_ID>",
      "name": "marksblogg",
      "type": "singletenant"
    }
  ],
  "auth_token": {
    "access_token": "<API_TOKEN>",
    "expires_in": 3600,
  }
}

Hydrolix tables are grouped by project name. I'll create a new project called "taxi".

$ curl \
    --request POST \
    --url https://marksblogg.hydrolix.io/config/v1/orgs/$ORG_ID/projects \
    --header 'content-type: application/json' \
    --header 'Authorization: Bearer $API_TOKEN' \
    --data '{"name": "taxi", "org": "$ORG_ID"}'

The response should look something like the following:

{
  "uuid": "<PROJECT_ID>",
  "name": "taxi",
  "description": "",
  "created": "2021-01-05T08:00:00.488083Z",
  "modified": "2021-01-05T08:00:00.488104Z",
  "org": "<ORG_ID>"
}

I'll create a table called "trips" within the "taxi" project.

$ curl \
    --request POST \
    --url https://marksblogg.hydrolix.io/config/v1/orgs/$ORG_ID/projects/$PROJECT_ID/tables \
    --header 'content-type: application/json' \
    --header 'Authorization: Bearer $API_TOKEN' \
    --data '{ "name": "trips", "project": "$PROJECT_ID"}'

The response should look something like the following:

{
  "uuid": "<TABLE_ID>",
  "name": "trips",
  "description": "",
  "created": "2021-01-05T08:00:00.334094Z",
  "modified": "2021-01-05T08:00:00.334126Z",
  "settings": {
    "stream": {
        "max_minutes_per_partition": 15,
        "max_rows_per_partition": 33554432
        }
    }
  "project": "$PROJECT_ID"
}

I'll now configure a transform for the trips table so that it knows it's ingesting GZIP-compressed, comma-delimited, CSV data.

$ curl \
    --request POST \
    --url https://marksblogg.hydrolix.io/config/v1/orgs/$ORG_ID/projects/$PROJECT_ID/tables/$TABLE_ID/transforms \
    --header 'content-type: application/json' \
    --header 'Authorization: Bearer $API_TOKEN' \
    --data '{
        "name": "taxi_transform",
        "type": "csv",
        "table": "$TABLE_ID",
        "settings": {
          "compression": "gzip",
          "is_default": true,
          "format_details": {
             "delimiter": ","
          },
          "output_columns": [
              {"name": "trip_id",               "type": "uint64", "treatment": "tag", "position": 0},
              {"name": "vendor_id",             "type": "string", "treatment": "tag", "position": 1},
              {"name": "pickup_datetime",       "type": "datetime", "format": "2006-01-02 15:04:05", "treatment": "primary", "position": 2},
              {"name": "dropoff_datetime",      "type": "datetime","format": "2006-01-02 15:04:05", "treatment": "tag", "position": 3},
              {"name": "store_and_fwd_flag",    "type": "string", "treatment": "tag", "position": 4},
              {"name": "rate_code_id",          "type": "uint64", "treatment": "tag", "position": 5},
              {"name": "pickpup_longitude",     "type": "double", "treatment": "metric", "position": 6},
              {"name": "pickpup_latitude",      "type": "double", "treatment": "metric", "position": 7},
              {"name": "dropoff_longitude",     "type": "double", "treatment": "metric", "position": 8},
              {"name": "dropoff_latitude",      "type": "double", "treatment": "metric", "position": 9},
              {"name": "passenger_count",       "type": "uint64", "treatment": "metric", "position": 10},
              {"name": "trip_distance",         "type": "double", "treatment": "metric", "position": 11},
              {"name": "fare_amount",           "type": "double", "treatment": "metric", "position": 12},
              {"name": "extra",                 "type": "double", "treatment": "metric", "position": 13},
              {"name": "mta_tax",               "type": "double", "treatment": "metric", "position": 14},
              {"name": "tip_amount",            "type": "double", "treatment": "metric", "position": 15},
              {"name": "tolls_amount",          "type": "double", "treatment": "metric", "position": 16},
              {"name": "ehail_fee",             "type": "double", "treatment": "metric", "position": 17},
              {"name": "improvement_surcharge", "type": "double", "treatment": "metric", "position": 18},
              {"name": "total_amount",          "type": "double", "treatment": "metric", "position": 19},
              {"name": "payment_type",          "type": "string", "treatment": "tag", "position": 20},
              {"name": "trip_type",             "type": "uint64", "treatment": "tag", "position": 21},
              {"name": "pickup",                "type": "string", "treatment": "tag", "position": 22},
              {"name": "dropoff",               "type": "string", "treatment": "tag", "position": 23},
              {"name": "cab_type",              "type": "string", "treatment": "tag", "position": 24},
              {"name": "precipitation",         "type": "uint64", "treatment": "metric", "position": 25},
              {"name": "snow_depth",            "type": "uint64", "treatment": "metric", "position": 26},
              {"name": "snowfall",              "type": "uint64", "treatment": "metric", "position": 27},
              {"name": "max_temperature",       "type": "uint64", "treatment": "metric", "position": 28},
              {"name": "min_temperature",       "type": "uint64", "treatment": "metric", "position": 29},
              {"name": "average_wind_speed",    "type": "uint64", "treatment": "metric", "position": 30},
              {"name": "pickup_nyct2010_grid",  "type": "uint64", "treatment": "tag", "position": 31},
              {"name": "pickup_ctlabel",        "type": "string", "treatment": "tag", "position": 32},
              {"name": "pickup_borocode",       "type": "uint64", "treatment": "tag", "position": 33},
              {"name": "pickup_boroname",       "type": "string", "treatment": "tag", "position": 34},
              {"name": "pickup_ct2010",         "type": "string", "treatment": "tag", "position": 35},
              {"name": "pickup_boroct2010",     "type": "string", "treatment": "tag", "position": 36},
              {"name": "pickup_cdeligibil",     "type": "string", "treatment": "tag", "position": 37},
              {"name": "pickup_ntacode",        "type": "string", "treatment": "tag", "position": 38},
              {"name": "pickup_ntaname",        "type": "string", "treatment": "tag", "position": 39},
              {"name": "pickup_puma",           "type": "string", "treatment": "tag", "position": 40},
              {"name": "dropoff_nyct2010_grid", "type": "uint64", "treatment": "tag", "position": 41},
              {"name": "dropoff_ctlabel",       "type": "string", "treatment": "tag", "position": 42},
              {"name": "dropoff_borocode",      "type": "uint64", "treatment": "tag", "position": 43},
              {"name": "dropoff_boroname",      "type": "string", "treatment": "tag", "position": 44},
              {"name": "dropoff_ct2010",        "type": "string", "treatment": "tag", "position": 45},
              {"name": "dropoff_boroct2010",    "type": "string", "treatment": "tag", "position": 46},
              {"name": "dropoff_cdeligibil",    "type": "string", "treatment": "tag", "position": 47},
              {"name": "dropoff_ntacode",       "type": "string", "treatment": "tag", "position": 48},
              {"name": "dropoff_ntaname",       "type": "string", "treatment": "tag", "position": 49},
              {"name": "dropoff_puma",          "type": "string", "treatment": "tag", "position": 50}
            ]
          }
        }'

I'll now scale up the batch-peer count to ten r5.xlarge EC2 instances in order to speed up the CSV import.

$ ./hdxctl scale \
        --batch-peer-count 10 \
        --batch-peer-instance-type r5.xlarge \
        $CLIENT_ID $CLUSTER_ID

The following will trigger a batch job that will conduct the import.

$ curl \
    --request POST \
    --url https://$HOSTNAME.hydrolix.io/config/v1/orgs/$ORG_ID/jobs \
    --header 'content-type: application/json' \
    --header 'Authorization: Bearer $API_TOKEN ' \
    --data '{"name": "taxi data import",
             "settings": {
                 "max_rows_per_partition": 1000000,
                 "max_minutes_per_partition": 14400,
                 "source":{
                   "table": "taxi.trips",
                   "type": "batch",
                   "subtype": "aws s3",
                   "transform": "taxi_trips_transform",
                   "settings":{
                     "url": "s3://some-bucket/taxi-trips/"
                   }
                 }
              }
           }'

The import job can be monitored with the following command.

$ curl \
    --request GET \
    --url https://$HOSTNAME.hydrolix.io/config/v1/orgs/$ORG_ID/jobs \
    --header 'content-type: application/json' \
    --header 'Authorization: Bearer $API_TOKEN' \
    --data '{"name": "taxi data import"}'

Once the import is complete I can scale the batch-peer node count down to zero.

$ ./hdxctl scale \
        --batch-peer-count 0 \
        $CLIENT_ID $CLUSTER_ID

The above import completed in ~30 minutes and resulted in 1,390 HDX partitions being stored on S3 with a total size of 83.6 GB.

Benchmarking Hydrolix

For the benchmark below, I'll scale the query-head count to one and the query-peer count to three. Both will be using EC2's c5n.9xlarge instance type.

$ ./hdxctl scale \
    --query-head-count 1 \
    --query-peer-count 3 \
    --query-peer-instance-type c5n.9xlarge \
    $CLIENT_ID $CLUSTER_ID

The following were executed via the Hydrolix SQL interface that is exposed via its REST-based HTTP API.

The following completed in 0.466 seconds.

SELECT cab_type,
       COUNT(*)
FROM taxi.trips
GROUP BY cab_type;

The following completed in 1.094 seconds.

SELECT passenger_count,
       AVG(total_amount)
FROM taxi.trips
GROUP BY passenger_count;

The following completed in 0.742 seconds.

SELECT passenger_count,
       toYear(pickup_datetime) AS year,
       COUNT(*)
FROM taxi.trips
GROUP BY passenger_count,
         year;

The following completed in 1.412 seconds.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       ROUND(trip_distance) AS distance,
       COUNT(*)
FROM taxi.trips
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         COUNT(*) DESC;

With the benchmark complete I'll scale down the query-head and query-peer nodes to zero. This will save me from having to pay for any ongoing EC2 fees for those machines.

$ ./hdxctl scale \
    --query-head-count 0 \
    --query-peer-count 0 \
    $CLIENT_ID $CLUSTER_ID

Final Thoughts

Query #4's benchmark time managed to out-perform the last ClickHouse benchmark. The query does a fair amount of transforming and has traditionally been the slowest query to compute regardless of the database I've run it on. It's good to see optimisations in this space.

It's also good to see these new breeds of analytics engines are treating AWS S3 as commodity storage. It's very inexpensive to keep data stored there and I haven't found another service that has their level of proven durability. The security options available with S3 are akin to a Swiss Army Knife.

A few years ago, I rented an AirBNB in Southern Europe from an Oracle Engineer working for a telecoms firm. He told me the staff at his firm are forbidden from doing anything on their Data Warehouse during the last ten days of any month as they needed to make sure every customers' bill generated on time. With something like Hydrolix, his employer could get their credit card out, scale up for the job and scale back down afterwards.

Having the ability to turn parts of a distributed database on and off is fantastic. Countless times in my career I've had to import a legacy dataset into a new database and often this process is limited by the hardware the cluster was designed to run 24/7. In some cases, I've manually launched extra EC2 instances to handle the import and transformations but this can take up a lot of expensive data engineering time.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.