1.1 Billion Taxi Rides in ClickHouse on DoubleCloud

DoubleCloud is a newly-launched managed database provider. With a headcount of 50, they have development and sales offices in Berlin, Prague and Tel Aviv. Further office openings in the US, UK and France are expected in the next few months. DoubleCloud lets you run ClickHouse and Kafka on your own Cloud account. They provide a UI to set up and scale while software and operating system updates are handled automatically.

DoubleCloud also offers consulting services to help migrate existing workloads to their offering. Earlier this year, I teamed up with them to help Paris-based Enterprise SEO provider Botify understand what it would take to move their 5 PB data warehouse from BigQuery to DoubleCloud.

The key to being able to improve Botify's workload performance while saving ~5x on storage costs was the S3 storage engine for ClickHouse. This allows storage to scale automatically, minimises maintenance and typically only incurs a 2-3x performance penalty over DoubleCloud's EBS-based storage option.

The S3 support in DoubleCloud includes a Hybrid storage engine. When you launch a cluster, an S3 bucket is set up automatically. When your cluster detects it's running low on local storage space it'll automatically transfer your data off to S3. This migration won't require any existing queries to be rewritten. Communication with S3 is optimised with a caching layer which can speed up queries by 4.5x in some cases once the cache has been warmed up.

DoubleCloud's support has 16 hours a day of coverage year-round. The firm also hosts multi-day training workshops. Active contributors to the ClickHouse codebase staff their 3rd line of support. These engineers are experts in both C++ and database and operating system architecture.

DoubleCloud support per-second billing and its pricing model sets a ceiling on spend rather than a floor. In the past, I've seen clients hit with sticker shock after simple scripts hit their pay-per-query database providers more often than expected. With a price ceiling, predicting spend is much easier.

In this post, I'm going to see how fast DoubleCloud can run the 1.1 billion taxi rides benchmark. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, Elasticsearch, EMR, Hydrolix, kdb+/q, OmniSci, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

DoubleCloud, Up & Running

DoubleCloud offers a UI for launching ClickHouse. For this post, I'll launch an s1-c32-m128 instance on AWS' eu-central-1 region in Frankfurt, Germany. This instance has 32 cores, 128 GB of RAM, 512 GB of storage and runs version 22.8 of DoubleCloud's ClickHouse release. If this cluster is left running for a month it would cost $2,516.35 before any sales taxes.

I'll install ClickHouse's client locally on my system and use it to connect to the newly launched instance.

$ curl https://clickhouse.com/ | sh
$ ./clickhouse client \
    --host ****.at.double.cloud \
    --port 9440 \
    --secure \
    --user admin \
    --password ****

Importing 1.1 Billion Taxi Trips

The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over six years. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

I'll first create a new row-based table.

CREATE TABLE ebs.trips (
    trip_id                 UInt32,
    vendor_id               String,
    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),
    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = Log;

I'll then import the dataset from S3 using 56 threads.

INSERT INTO ebs.trips
SELECT *
FROM s3('https://s3.eu-central-1.amazonaws.com/<bucket>/csv/trips*.csv.gz',
        '<access-key>',
        '<secret-access-key>',
        'CSV',
        '
  trip_id                 UInt32,
  vendor_id               String,
  pickup_datetime         DateTime,
  dropoff_datetime        Nullable(DateTime),
  store_and_fwd_flag      Nullable(FixedString(1)),
  rate_code_id            Nullable(UInt8),
  pickup_longitude        Nullable(Float64),
  pickup_latitude         Nullable(Float64),
  dropoff_longitude       Nullable(Float64),
  dropoff_latitude        Nullable(Float64),
  passenger_count         Nullable(UInt8),
  trip_distance           Nullable(Float64),
  fare_amount             Nullable(Float32),
  extra                   Nullable(Float32),
  mta_tax                 Nullable(Float32),
  tip_amount              Nullable(Float32),
  tolls_amount            Nullable(Float32),
  ehail_fee               Nullable(Float32),
  improvement_surcharge   Nullable(Float32),
  total_amount            Nullable(Float32),
  payment_type            Nullable(String),
  trip_type               Nullable(UInt8),
  pickup                  Nullable(String),
  dropoff                 Nullable(String),
  cab_type                Nullable(String),
  precipitation           Nullable(Int8),
  snow_depth              Nullable(Int8),
  snowfall                Nullable(Int8),
  max_temperature         Nullable(Int8),
  min_temperature         Nullable(Int8),
  average_wind_speed      Nullable(Int8),
  pickup_nyct2010_gid     Nullable(Int8),
  pickup_ctlabel          Nullable(String),
  pickup_borocode         Nullable(Int8),
  pickup_boroname         Nullable(String),
  pickup_ct2010           Nullable(String),
  pickup_boroct2010       Nullable(String),
  pickup_cdeligibil       Nullable(FixedString(1)),
  pickup_ntacode          Nullable(String),
  pickup_ntaname          Nullable(String),
  pickup_puma             Nullable(String),
  dropoff_nyct2010_gid    Nullable(UInt8),
  dropoff_ctlabel         Nullable(String),
  dropoff_borocode        Nullable(UInt8),
  dropoff_boroname        Nullable(String),
  dropoff_ct2010          Nullable(String),
  dropoff_boroct2010      Nullable(String),
  dropoff_cdeligibil      Nullable(String),
  dropoff_ntacode         Nullable(String),
  dropoff_ntaname         Nullable(String),
  dropoff_puma            Nullable(String)',
  'gzip')
SETTINGS max_threads=56,
         max_insert_threads=56;

The above finished in 29 minutes and 25 seconds. The resulting table is stored on the instance's EBS volume and is 144.88 GB in size. Below I'll convert the data into an optimised column-based table.

SET max_insert_threads=2;

CREATE TABLE ebs.trips_mergetree
    ENGINE = MergeTree()
        PARTITION BY toYYYYMM(pickup_date)
        ORDER BY (pickup_datetime)
    AS SELECT
        trip_id,
        CAST(vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14)) AS vendor_id,
        toDate(pickup_datetime)                 AS pickup_date,
        ifNull(pickup_datetime, toDateTime(0))  AS pickup_datetime,
        toDate(dropoff_datetime)                AS dropoff_date,
        ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
        assumeNotNull(store_and_fwd_flag)       AS store_and_fwd_flag,
        assumeNotNull(rate_code_id)             AS rate_code_id,

        assumeNotNull(pickup_longitude)         AS pickup_longitude,
        assumeNotNull(pickup_latitude)          AS pickup_latitude,
        assumeNotNull(dropoff_longitude)        AS dropoff_longitude,
        assumeNotNull(dropoff_latitude)         AS dropoff_latitude,
        assumeNotNull(passenger_count)          AS passenger_count,
        assumeNotNull(trip_distance)            AS trip_distance,
        assumeNotNull(fare_amount)              AS fare_amount,
        assumeNotNull(extra)                    AS extra,
        assumeNotNull(mta_tax)                  AS mta_tax,
        assumeNotNull(tip_amount)               AS tip_amount,
        assumeNotNull(tolls_amount)             AS tolls_amount,
        assumeNotNull(ehail_fee)                AS ehail_fee,
        assumeNotNull(improvement_surcharge)    AS improvement_surcharge,
        assumeNotNull(total_amount)             AS total_amount,
        assumeNotNull(payment_type)             AS payment_type_,
        assumeNotNull(trip_type)                AS trip_type,

        pickup AS pickup,
        pickup AS dropoff,

        CAST(assumeNotNull(cab_type)
            AS Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

        precipitation           AS precipitation,
        snow_depth              AS snow_depth,
        snowfall                AS snowfall,
        max_temperature         AS max_temperature,
        min_temperature         AS min_temperature,
        average_wind_speed      AS average_wind_speed,

        pickup_nyct2010_gid     AS pickup_nyct2010_gid,
        pickup_ctlabel          AS pickup_ctlabel,
        pickup_borocode         AS pickup_borocode,
        pickup_boroname         AS pickup_boroname,
        pickup_ct2010           AS pickup_ct2010,
        pickup_boroct2010       AS pickup_boroct2010,
        pickup_cdeligibil       AS pickup_cdeligibil,
        pickup_ntacode          AS pickup_ntacode,
        pickup_ntaname          AS pickup_ntaname,
        pickup_puma             AS pickup_puma,

        dropoff_nyct2010_gid    AS dropoff_nyct2010_gid,
        dropoff_ctlabel         AS dropoff_ctlabel,
        dropoff_borocode        AS dropoff_borocode,
        dropoff_boroname        AS dropoff_boroname,
        dropoff_ct2010          AS dropoff_ct2010,
        dropoff_boroct2010      AS dropoff_boroct2010,
        dropoff_cdeligibil      AS dropoff_cdeligibil,
        dropoff_ntacode         AS dropoff_ntacode,
        dropoff_ntaname         AS dropoff_ntaname,
        dropoff_puma            AS dropoff_puma
    FROM ebs.trips;

The above finished in 50 minutes and 43 seconds and takes up 98.3 GB of EBS storage.

The following is an optional step but it saved 593 MB of space and improved query times by 7-10%. It took 51 minutes and 50 seconds to complete.

OPTIMIZE TABLE ebs.trips_mergetree FINAL;

Benchmarking DoubleCloud

The following were the fastest times I saw after running each query multiple times.

The following finished in 0.347 seconds.

SELECT cab_type, count(*)
FROM ebs.trips_mergetree
GROUP BY cab_type;

The following finished in 1.1 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM ebs.trips_mergetree
GROUP BY passenger_count;

The following finished in 1.389 seconds.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
         year;

The following finished in 2.935 seconds.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM ebs.trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Closing Thoughts

The first query's time is the fastest I've seen among any of the Cloud-exclusive offerings I've run benchmarks on. It's great to see these times without the need for any sort of complex tuning or extensive hardware planning.

Hybrid storage is a real game changer. Often clients want to keep an entire dataset, if not several, accessible via a single database. These datasets can stretch back years but often only the past month or two is being actively queried. Provisioning 512 GB of local storage for a 5 TB dataset means you can keep the EBS costs low, let DoubleCloud figure out which records should live on S3 and which should be cached while enjoying ClickHouse's incredible performance.

The per-second billing is also handy. I have clients that have modest datasets but occasionally bring in a large dataset for a one-off analysis piece. Being able to spin up a cluster for a few hours if not minutes and get the answers out of the data while not paying for a second more than needed is fantastic.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.