ClickHouse is an open-source, columnar-oriented database that's been developed primarily by engineers at Yandex. Yandex is one of Europe's largest Internet-focused businesses. There search engine alone was receiving 150 million searches a day in 2012. ClickHouse has been deployed among a number of their businesses including their Metrica offering which is the world's second-largest web analytics platform. Outside of Yandex, ClickHouse has also been deployed at CERN where it was used to analyse events from the Large Hadron Collider.
ClickHouse is very feature-rich. It supports 16 different table engines, its CLI includes animated progress indicators and syntax highlighting and its performance puts it into its own class among open source offerings.
That being said, the software isn't exactly a drop-in replacement for something like PostgreSQL. It lacks transactions, full-fledged UPDATE and DELETE statements and calls like DROP TABLE won't work on tables above 54 GB without first changing the server's configuration. Nonetheless, it's a very respectable data store.
Alexey Milovidov, a senior software engineer at Yandex and a core contributor to ClickHouse, put together a guide to benchmarking the database using a variation of the 1.1 billion taxi ride dataset I use in my benchmarks.
In this blog post, I'll be running a benchmark on ClickHouse using the exact same set I've used to benchmark Amazon Athena, BigQuery, Elasticsearch, kdb+/q, MapD, PostgreSQL, Presto, Redshift, Spark and Vertica. I've compiled a single-page summary of these benchmarks.
For this benchmark I'll be running a fresh installation of Ubuntu 14.04.3 LTS on an Intel Core i5 4670K clocked at 3.4 GHz, 16 GB of DDR3 RAM and a SanDisk SDSSDHII960G 960 GB SSD drive.
ClickHouse Up & Running
I'll first run a few commands so that I can install ClickHouse from pre-made packages.
$ echo "deb http://repo.yandex.ru/clickhouse/trusty stable main" | \ sudo tee -a /etc/apt/sources.list $ sudo apt-key adv \ --keyserver keyserver.ubuntu.com \ --recv E0C56BD4 $ sudo apt update
I'll then install version 1.1.54140 of the client and server software.
$ sudo apt install \ clickhouse-client \ clickhouse-server-common
I'll then launch the server. I haven't set up any bespoke configuration so all the settings are what you'd get right out of the box.
$ sudo service clickhouse-server start
Loading 1.1 Billion Trips into ClickHouse
The dataset I'm working with is broken up into 56 gzip-compressed, CSV files. I'll first import them into a "trips" table which is backed by the "Log" engine.
CREATE TABLE trips ( trip_id UInt32, vendor_id String, pickup_datetime DateTime, dropoff_datetime Nullable(DateTime), store_and_fwd_flag Nullable(FixedString(1)), rate_code_id Nullable(UInt8), pickup_longitude Nullable(Float64), pickup_latitude Nullable(Float64), dropoff_longitude Nullable(Float64), dropoff_latitude Nullable(Float64), passenger_count Nullable(UInt8), trip_distance Nullable(Float64), fare_amount Nullable(Float32), extra Nullable(Float32), mta_tax Nullable(Float32), tip_amount Nullable(Float32), tolls_amount Nullable(Float32), ehail_fee Nullable(Float32), improvement_surcharge Nullable(Float32), total_amount Nullable(Float32), payment_type Nullable(String), trip_type Nullable(UInt8), pickup Nullable(String), dropoff Nullable(String), cab_type Nullable(String), precipitation Nullable(Int8), snow_depth Nullable(Int8), snowfall Nullable(Int8), max_temperature Nullable(Int8), min_temperature Nullable(Int8), average_wind_speed Nullable(Int8), pickup_nyct2010_gid Nullable(Int8), pickup_ctlabel Nullable(String), pickup_borocode Nullable(Int8), pickup_boroname Nullable(String), pickup_ct2010 Nullable(String), pickup_boroct2010 Nullable(String), pickup_cdeligibil Nullable(FixedString(1)), pickup_ntacode Nullable(String), pickup_ntaname Nullable(String), pickup_puma Nullable(String), dropoff_nyct2010_gid Nullable(UInt8), dropoff_ctlabel Nullable(String), dropoff_borocode Nullable(UInt8), dropoff_boroname Nullable(String), dropoff_ct2010 Nullable(String), dropoff_boroct2010 Nullable(String), dropoff_cdeligibil Nullable(String), dropoff_ntacode Nullable(String), dropoff_ntaname Nullable(String), dropoff_puma Nullable(String) ) ENGINE = Log;
The dataset itself uses commas for delimiting fields. None of the contents of the data contains any commas themselves so there are no quotations used to aid escaping data. NULL values are defined by the simple absence of any content between the comma delimiters. Normally this isn't an issue but with ClickHouse empty fields won't be treated as NULLs in order to avoid ambiguity with empty strings. For this reason, I need to pipe the data through a transformation script that will replace all empty values with \N.
Below is the transformation script.
$ cat trans.py
import sys for line in sys.stdin: print ','.join([item if len(item.strip()) else '\N' for item in line.strip().split(',')])
This is the bash command I used to import the 1.1 billion records into the trips table.
$ time (for filename in /home/mark/trips/trips_x*.csv.gz; do gunzip -c $filename | \ python trans.py | \ clickhouse-client \ --query="INSERT INTO trips FORMAT CSV" done)
The above completed in 3 hours 27 minutes and 35 seconds. There is little doubt that if I didn't use a Python script for transforming the contents that this time would have been significantly reduced. The following shows the top command during the import and you can see the Python script is eating up a lot of resources.
... %CPU %MEM TIME+ COMMAND ... 99.2 0.1 0:32.76 python trans.py ... 45.3 12.3 0:17.14 clickhouse-client --query=INSERT INTO trips FORMAT CSV ... 28.3 0.0 0:09.89 gzip -d -c /home/mark/trips/trips_xac.csv.gz
With the log engine-backed table in place, I'll create a new table that'll use the MergeTree engine. This table will be populated with a transformation of the trips table's contents.
In Milovidov's guide, a number of the fields for this table were converted from strings into enumerate columns. Some of these columns didn't support every value seen in my version of the dataset so I needed to remove the casting process for these fields. Below is my version of the MergeTree-backed table.
The following completed in 39 minutes and 57 seconds.
CREATE TABLE trips_mergetree ENGINE = MergeTree(pickup_date, pickup_datetime, 8192) AS SELECT trip_id, CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id, toDate(pickup_datetime) AS pickup_date, ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime, toDate(dropoff_datetime) AS dropoff_date, ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime, assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag, assumeNotNull(rate_code_id) AS rate_code_id, assumeNotNull(pickup_longitude) AS pickup_longitude, assumeNotNull(pickup_latitude) AS pickup_latitude, assumeNotNull(dropoff_longitude) AS dropoff_longitude, assumeNotNull(dropoff_latitude) AS dropoff_latitude, assumeNotNull(passenger_count) AS passenger_count, assumeNotNull(trip_distance) AS trip_distance, assumeNotNull(fare_amount) AS fare_amount, assumeNotNull(extra) AS extra, assumeNotNull(mta_tax) AS mta_tax, assumeNotNull(tip_amount) AS tip_amount, assumeNotNull(tolls_amount) AS tolls_amount, assumeNotNull(ehail_fee) AS ehail_fee, assumeNotNull(improvement_surcharge) AS improvement_surcharge, assumeNotNull(total_amount) AS total_amount, CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_, assumeNotNull(trip_type) AS trip_type, ifNull(toFixedString(unhex(pickup), 25), toFixedString('', 25)) AS pickup, ifNull(toFixedString(unhex(dropoff), 25), toFixedString('', 25)) AS dropoff, CAST(assumeNotNull(cab_type) AS Enum8('yellow' = 1, 'green' = 2, 'uber' = 3)) AS cab_type, assumeNotNull(pickup_nyct2010_gid) AS pickup_nyct2010_gid, toFloat32(ifNull(pickup_ctlabel, '0')) AS pickup_ctlabel, assumeNotNull(pickup_borocode) AS pickup_borocode, assumeNotNull(pickup_boroname) AS pickup_ct2010, toFixedString(ifNull(pickup_boroct2010, '0000000'), 7) AS pickup_boroct2010, assumeNotNull(ifNull(pickup_cdeligibil, ' ')) AS pickup_cdeligibil, toFixedString(ifNull(pickup_ntacode, '0000'), 4) AS pickup_ntacode, assumeNotNull(pickup_ntaname) AS pickup_ntaname, toUInt16(ifNull(pickup_puma, '0')) AS pickup_puma, assumeNotNull(dropoff_nyct2010_gid) AS dropoff_nyct2010_gid, toFloat32(ifNull(dropoff_ctlabel, '0')) AS dropoff_ctlabel, assumeNotNull(dropoff_borocode) AS dropoff_borocode, assumeNotNull(dropoff_boroname) AS dropoff_ct2010, toFixedString(ifNull(dropoff_boroct2010, '0000000'), 7) AS dropoff_boroct2010, assumeNotNull(ifNull(dropoff_cdeligibil, ' ')) AS dropoff_cdeligibil, toFixedString(ifNull(dropoff_ntacode, '0000'), 4) AS dropoff_ntacode, assumeNotNull(dropoff_ntaname) AS dropoff_ntaname, toUInt16(ifNull(dropoff_puma, '0')) AS dropoff_puma FROM trips
For this benchmark, I'll run the SQL commands via the ClickHouse CLI. By default, the CLI prints timing telemetry with millisecond granularity after each SELECT statement.
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
The following completed in 1.034 seconds.
SELECT cab_type, count(*) FROM trips_mergetree GROUP BY cab_type
The following completed in 3.058 seconds.
SELECT passenger_count, avg(total_amount) FROM trips_mergetree GROUP BY passenger_count
The following completed in 5.354 seconds.
SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetree GROUP BY passenger_count, year
The following completed in 12.748 seconds.
SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*) FROM trips_mergetree GROUP BY passenger_count, year, distance ORDER BY year, count(*) DESC
These results put ClickHouse in the #6 position on my benchmarks recap board. The performance seen, given the fact the hardware behind this benchmark wasn't anything to write home about, is impressive.
It's a shame this software isn't more widely popular as it makes excellent use of commodity hardware. Fingers crossed this blog post goes some way to encourage those in the data community to give ClickHouse a try.