1.1 Billion Taxi Rides on ClickHouse & an Intel Core i5

ClickHouse is an open-source, columnar-oriented database that's been developed primarily by engineers at Yandex. Yandex is one of Europe's largest Internet-focused businesses. There search engine alone was receiving 150 million searches a day in 2012. ClickHouse has been deployed among a number of their businesses including their Metrica offering which is the world's second-largest web analytics platform. Outside of Yandex, ClickHouse has also been deployed at CERN where it was used to analyse events from the Large Hadron Collider.

ClickHouse is very feature-rich. It supports 16 different table engines, its CLI includes animated progress indicators and syntax highlighting and its performance puts it into its own class among open source offerings.

That being said, the software isn't exactly a drop-in replacement for something like PostgreSQL. It lacks transactions, full-fledged UPDATE and DELETE statements and calls like DROP TABLE won't work on tables above 54 GB without first changing the server's configuration. Nonetheless, it's a very respectable data store.

Alexey Milovidov, a senior software engineer at Yandex and a core contributor to ClickHouse, put together a guide to benchmarking the database using a variation of the 1.1 billion taxi ride dataset I use in my benchmarks.

In this blog post, I'll be running a benchmark on ClickHouse using the exact same set I've used to benchmark Amazon Athena, BigQuery, Elasticsearch, kdb+/q, MapD, PostgreSQL, Presto, Redshift, Spark and Vertica. I've compiled a single-page summary of these benchmarks.

The Hardware

For this benchmark I'll be running a fresh installation of Ubuntu 14.04.3 LTS on an Intel Core i5 4670K clocked at 3.4 GHz, 16 GB of DDR3 RAM and a SanDisk SDSSDHII960G 960 GB SSD drive.

ClickHouse Up & Running

I'll first run a few commands so that I can install ClickHouse from pre-made packages.

$ echo "deb http://repo.yandex.ru/clickhouse/trusty stable main" | \
    sudo tee -a /etc/apt/sources.list
$ sudo apt-key adv \
    --keyserver keyserver.ubuntu.com \
    --recv E0C56BD4
$ sudo apt update

I'll then install version 1.1.54140 of the client and server software.

$ sudo apt install \
    clickhouse-client \
    clickhouse-server-common

I'll then launch the server. I haven't set up any bespoke configuration so all the settings are what you'd get right out of the box.

$ sudo service clickhouse-server start

Loading 1.1 Billion Trips into ClickHouse

The dataset I'm working with is broken up into 56 gzip-compressed, CSV files. I'll first import them into a "trips" table which is backed by the "Log" engine.

$ clickhouse-client

CREATE TABLE trips (
    trip_id                 UInt32,
    vendor_id               String,

    pickup_datetime         DateTime,
    dropoff_datetime        Nullable(DateTime),

    store_and_fwd_flag      Nullable(FixedString(1)),
    rate_code_id            Nullable(UInt8),
    pickup_longitude        Nullable(Float64),
    pickup_latitude         Nullable(Float64),
    dropoff_longitude       Nullable(Float64),
    dropoff_latitude        Nullable(Float64),
    passenger_count         Nullable(UInt8),
    trip_distance           Nullable(Float64),
    fare_amount             Nullable(Float32),
    extra                   Nullable(Float32),
    mta_tax                 Nullable(Float32),
    tip_amount              Nullable(Float32),
    tolls_amount            Nullable(Float32),
    ehail_fee               Nullable(Float32),
    improvement_surcharge   Nullable(Float32),
    total_amount            Nullable(Float32),
    payment_type            Nullable(String),
    trip_type               Nullable(UInt8),
    pickup                  Nullable(String),
    dropoff                 Nullable(String),

    cab_type                Nullable(String),

    precipitation           Nullable(Int8),
    snow_depth              Nullable(Int8),
    snowfall                Nullable(Int8),
    max_temperature         Nullable(Int8),
    min_temperature         Nullable(Int8),
    average_wind_speed      Nullable(Int8),

    pickup_nyct2010_gid     Nullable(Int8),
    pickup_ctlabel          Nullable(String),
    pickup_borocode         Nullable(Int8),
    pickup_boroname         Nullable(String),
    pickup_ct2010           Nullable(String),
    pickup_boroct2010       Nullable(String),
    pickup_cdeligibil       Nullable(FixedString(1)),
    pickup_ntacode          Nullable(String),
    pickup_ntaname          Nullable(String),
    pickup_puma             Nullable(String),

    dropoff_nyct2010_gid    Nullable(UInt8),
    dropoff_ctlabel         Nullable(String),
    dropoff_borocode        Nullable(UInt8),
    dropoff_boroname        Nullable(String),
    dropoff_ct2010          Nullable(String),
    dropoff_boroct2010      Nullable(String),
    dropoff_cdeligibil      Nullable(String),
    dropoff_ntacode         Nullable(String),
    dropoff_ntaname         Nullable(String),
    dropoff_puma            Nullable(String)
) ENGINE = Log;

The dataset itself uses commas for delimiting fields. None of the contents of the data contains any commas themselves so there are no quotations used to aid escaping data. NULL values are defined by the simple absence of any content between the comma delimiters. Normally this isn't an issue but with ClickHouse empty fields won't be treated as NULLs in order to avoid ambiguity with empty strings. For this reason, I need to pipe the data through a transformation script that will replace all empty values with \N.

Below is the transformation script.

$ cat trans.py

import sys


for line in sys.stdin:
    print ','.join([item if len(item.strip()) else '\N'
                    for item in line.strip().split(',')])

This is the bash command I used to import the 1.1 billion records into the trips table.

$ time (for filename in /home/mark/trips/trips_x*.csv.gz; do
            gunzip -c $filename | \
                python trans.py | \
                clickhouse-client \
                    --query="INSERT INTO trips FORMAT CSV"
        done)

The above completed in 3 hours 27 minutes and 35 seconds. There is little doubt that if I didn't use a Python script for transforming the contents that this time would have been significantly reduced. The following shows the top command during the import and you can see the Python script is eating up a lot of resources.

... %CPU %MEM     TIME+ COMMAND
... 99.2  0.1   0:32.76 python trans.py
... 45.3 12.3   0:17.14 clickhouse-client --query=INSERT INTO trips FORMAT CSV
... 28.3  0.0   0:09.89 gzip -d -c /home/mark/trips/trips_xac.csv.gz

With the log engine-backed table in place, I'll create a new table that'll use the MergeTree engine. This table will be populated with a transformation of the trips table's contents.

In Milovidov's guide, a number of the fields for this table were converted from strings into enumerate columns. Some of these columns didn't support every value seen in my version of the dataset so I needed to remove the casting process for these fields. Below is my version of the MergeTree-backed table.

The following completed in 39 minutes and 57 seconds.

CREATE TABLE trips_mergetree
ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
AS SELECT
    trip_id,
    CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id,
    toDate(pickup_datetime) AS pickup_date,
    ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime,
    toDate(dropoff_datetime) AS dropoff_date,
    ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
    assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag,
    assumeNotNull(rate_code_id) AS rate_code_id,
    assumeNotNull(pickup_longitude) AS pickup_longitude,
    assumeNotNull(pickup_latitude) AS pickup_latitude,
    assumeNotNull(dropoff_longitude) AS dropoff_longitude,
    assumeNotNull(dropoff_latitude) AS dropoff_latitude,
    assumeNotNull(passenger_count) AS passenger_count,
    assumeNotNull(trip_distance) AS trip_distance,
    assumeNotNull(fare_amount) AS fare_amount,
    assumeNotNull(extra) AS extra,
    assumeNotNull(mta_tax) AS mta_tax,
    assumeNotNull(tip_amount) AS tip_amount,
    assumeNotNull(tolls_amount) AS tolls_amount,
    assumeNotNull(ehail_fee) AS ehail_fee,
    assumeNotNull(improvement_surcharge) AS improvement_surcharge,
    assumeNotNull(total_amount) AS total_amount,
    CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_,
    assumeNotNull(trip_type) AS trip_type,
    ifNull(toFixedString(unhex(pickup), 25), toFixedString('', 25)) AS pickup,
    ifNull(toFixedString(unhex(dropoff), 25), toFixedString('', 25)) AS dropoff,
    CAST(assumeNotNull(cab_type) AS Enum8('yellow' = 1, 'green' = 2, 'uber' = 3)) AS cab_type,
    assumeNotNull(pickup_nyct2010_gid) AS pickup_nyct2010_gid,
    toFloat32(ifNull(pickup_ctlabel, '0')) AS pickup_ctlabel,
    assumeNotNull(pickup_borocode) AS pickup_borocode,
    assumeNotNull(pickup_boroname) AS pickup_ct2010,
    toFixedString(ifNull(pickup_boroct2010, '0000000'), 7) AS pickup_boroct2010,
    assumeNotNull(ifNull(pickup_cdeligibil, ' ')) AS pickup_cdeligibil,
    toFixedString(ifNull(pickup_ntacode, '0000'), 4) AS pickup_ntacode,
    assumeNotNull(pickup_ntaname) AS pickup_ntaname,
    toUInt16(ifNull(pickup_puma, '0')) AS pickup_puma,
    assumeNotNull(dropoff_nyct2010_gid) AS dropoff_nyct2010_gid,
    toFloat32(ifNull(dropoff_ctlabel, '0')) AS dropoff_ctlabel,
    assumeNotNull(dropoff_borocode) AS dropoff_borocode,
    assumeNotNull(dropoff_boroname) AS dropoff_ct2010,
    toFixedString(ifNull(dropoff_boroct2010, '0000000'), 7) AS dropoff_boroct2010,
    assumeNotNull(ifNull(dropoff_cdeligibil, ' ')) AS dropoff_cdeligibil,
    toFixedString(ifNull(dropoff_ntacode, '0000'), 4) AS dropoff_ntacode,
    assumeNotNull(dropoff_ntaname) AS dropoff_ntaname,
    toUInt16(ifNull(dropoff_puma, '0')) AS dropoff_puma
FROM trips

Benchmarking ClickHouse

For this benchmark, I'll run the SQL commands via the ClickHouse CLI. By default, the CLI prints timing telemetry with millisecond granularity after each SELECT statement.

The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".

$ clickhouse-client

The following completed in 1.034 seconds.

SELECT cab_type, count(*)
FROM trips_mergetree
GROUP BY cab_type

The following completed in 3.058 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips_mergetree
GROUP BY passenger_count

The following completed in 5.354 seconds.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year

The following completed in 12.748 seconds.

SELECT passenger_count,
       toYear(pickup_date) AS year,
       round(trip_distance) AS distance,
       count(*)
FROM trips_mergetree
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC

These results put ClickHouse in the #6 position on my benchmarks recap board. The performance seen, given the fact the hardware behind this benchmark wasn't anything to write home about, is impressive.

It's a shame this software isn't more widely popular as it makes excellent use of commodity hardware. Fingers crossed this blog post goes some way to encourage those in the data community to give ClickHouse a try.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.