Home | Benchmarks | Categories | Atom Feed

Posted on Tue 19 March 2024 under Databases

1.1 Billion Taxi Rides using ClickHouse on Intel's Core i9-14900K

In this post, I'm going to see how fast ClickHouse can run the 1.1 billion taxi rides benchmark. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, DuckDB, Elasticsearch, EMR, Hydrolix, kdb+/q, MapD / OmniSci / HEAVY.AI, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

My Workstation

For this benchmark, I'm using a 6 GHz Intel Core i9-14900K CPU. It has 8 performance cores and 16 efficiency cores with a total of 32 threads and 32 MB of L2 cache. It has a liquid cooler attached and is housed in a spacious, full-sized, Cooler Master HAF 700 computer case. I've come across videos on YouTube where people have managed to overclock the i9-14900K to 9.1 GHz.

The system has 48 GB of DDR5 RAM clocked at 5,200 MHz and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive.

There is also a 2 TB SSD connected via a SATA interface which contains the original taxi rides dataset. This drive peaks around 5-600 MB/s. This is my system's D drive.

The system is powered by a 1,200-watt, fully modular, Corsair Power Supply and is sat on an ASRock Z790 Pro RS Motherboard.

I'm running Ubuntu 22 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and I use ArcGIS Pro from time to time which only supports Windows natively.

ClickHouse Up & Running

Two years ago, ClickHouse began distributing as a single binary that can be downloaded with a single command. The following downloaded version to my home folder.

$ cd ~
$ curl https://clickhouse.com/ | sh

Importing 1.1 Billion Trips

The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over six years. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

I'll read the source data from /mnt/d/taxi which is on the SATA-connected SSD.

ClickHouse can both host data as a server and/or run in stand-alone 'local' mode. When run in local mode, will create a temporary working folder on the primary drive, the 5th-generation NVMe drive in this case, to store its data.

Below I'll launch ClickHouse in local mode and import the taxi ride dataset off of the SATA-backed SSD.

$ cd /mnt/d/taxi
$ ~/clickhouse
    trip_id                 INT,
    vendor_id               VARCHAR(3),
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      VARCHAR(1),
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(18,14),
    pickup_latitude         DECIMAL(18,14),
    dropoff_longitude       DECIMAL(18,14),
    dropoff_latitude        DECIMAL(18,14),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(18,6),
    fare_amount             DECIMAL(18,6),
    extra                   DECIMAL(18,6),
    mta_tax                 DECIMAL(18,6),
    tip_amount              DECIMAL(18,6),
    tolls_amount            DECIMAL(18,6),
    ehail_fee               DECIMAL(18,6),
    improvement_surcharge   DECIMAL(18,6),
    total_amount            DECIMAL(18,6),
    payment_type            VARCHAR(3),
    trip_type               SMALLINT,
    pickup                  VARCHAR(50),
    dropoff                 VARCHAR(50),

    cab_type                VARCHAR(6),

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10),
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13),
    pickup_ct2010           VARCHAR(6),
    pickup_boroct2010       VARCHAR(7),
    pickup_cdeligibil       VARCHAR(1),
    pickup_ntacode          VARCHAR(4),
    pickup_ntaname          VARCHAR(56),
    pickup_puma             VARCHAR(4),

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10),
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13),
    dropoff_ct2010          VARCHAR(6),
    dropoff_boroct2010      VARCHAR(7),
    dropoff_cdeligibil      VARCHAR(1),
    dropoff_ntacode         VARCHAR(4),
    dropoff_ntaname         VARCHAR(56),
    dropoff_puma            VARCHAR(4)
) ENGINE = Log;

    SELECT *
    FROM file('trips_*.csv.gz', CSV);

The above completed in 1 hour, 45 minutes and 58 seconds. The CPU sat at less than 10% utilisation during the import. RAM consumption started out at 13.5 GB and grew to 22 GB by the time the import was 20% complete. The SATA-backed SSD read at rates of 10-25 MB/s and there were occasional bursts of writes to the 5th-generation SSD at 500 MB/s.

Below are a few of the temporary files ClickHouse used to store the dataset in its internal format.

$ sudo lsof -OnP \
    | grep -o '/tmp/clickhouse-local.*' \
    | head

The dataset needs to be imported initially in row-oriented form. Once imported, a second, columnar-oriented table, which is more suited for analytics, can be built. The storage engine behind this table is called 'MergeTree'.

CREATE TABLE trips_mergetree
    ENGINE MergeTree()
    ORDER BY (pickup_date, pickup_datetime)
    SETTINGS index_granularity=8192
        CAST(vendor_id AS Enum8('1' = 1,
                                '2' = 2,
                                'CMT' = 3,
                                'VTS' = 4,
                                'DDS' = 5,
                                'B02512' = 10,
                                'B02598' = 11,
                                'B02617' = 12,
                                'B02682' = 13,
                                'B02764' = 14)) AS vendor_id,
        toDate(pickup_datetime)                 AS pickup_date,
        ifNull(pickup_datetime, toDateTime(0))  AS pickup_datetime,
        toDate(dropoff_datetime)                AS dropoff_date,
        ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
        assumeNotNull(store_and_fwd_flag)       AS store_and_fwd_flag,
        assumeNotNull(rate_code_id)             AS rate_code_id,

        assumeNotNull(pickup_longitude)         AS pickup_longitude,
        assumeNotNull(pickup_latitude)          AS pickup_latitude,
        assumeNotNull(dropoff_longitude)        AS dropoff_longitude,
        assumeNotNull(dropoff_latitude)         AS dropoff_latitude,
        assumeNotNull(passenger_count)          AS passenger_count,
        assumeNotNull(trip_distance)            AS trip_distance,
        assumeNotNull(fare_amount)              AS fare_amount,
        assumeNotNull(extra)                    AS extra,
        assumeNotNull(mta_tax)                  AS mta_tax,
        assumeNotNull(tip_amount)               AS tip_amount,
        assumeNotNull(tolls_amount)             AS tolls_amount,
        assumeNotNull(ehail_fee)                AS ehail_fee,
        assumeNotNull(improvement_surcharge)    AS improvement_surcharge,
        assumeNotNull(total_amount)             AS total_amount,
        assumeNotNull(payment_type)             AS payment_type_,
        assumeNotNull(trip_type)                AS trip_type,

            AS Enum8('yellow' = 1, 'green' = 2))
                                AS cab_type,

        pickup AS pickup,
        pickup AS dropoff,

        precipitation           AS precipitation,
        snow_depth              AS snow_depth,
        snowfall                AS snowfall,
        max_temperature         AS max_temperature,
        min_temperature         AS min_temperature,
        average_wind_speed      AS average_wind_speed,

        pickup_nyct2010_gid     AS pickup_nyct2010_gid,
        pickup_ctlabel          AS pickup_ctlabel,
        pickup_borocode         AS pickup_borocode,
        pickup_boroname         AS pickup_boroname,
        pickup_ct2010           AS pickup_ct2010,
        pickup_boroct2010       AS pickup_boroct2010,
        pickup_cdeligibil       AS pickup_cdeligibil,
        pickup_ntacode          AS pickup_ntacode,
        pickup_ntaname          AS pickup_ntaname,
        pickup_puma             AS pickup_puma,

        dropoff_nyct2010_gid    AS dropoff_nyct2010_gid,
        dropoff_ctlabel         AS dropoff_ctlabel,
        dropoff_borocode        AS dropoff_borocode,
        dropoff_boroname        AS dropoff_boroname,
        dropoff_ct2010          AS dropoff_ct2010,
        dropoff_boroct2010      AS dropoff_boroct2010,
        dropoff_cdeligibil      AS dropoff_cdeligibil,
        dropoff_ntacode         AS dropoff_ntacode,
        dropoff_ntaname         AS dropoff_ntaname,
        dropoff_puma            AS dropoff_puma
    FROM trips;

The MergeTree table was built in 27 minutes and 8 seconds. RAM consumption started out at 21.3 GB and topped out at just under 24 GB. The CPU utilisation was under 25% during most of the import with ~10 cores intermittently coming under heavy load. Below is a screenshot of the Windows' Task Manager during this operation.

Windows' Task Manager during ClickHouse Import

The 5th-generation SSD read at 80-200 MB/s with write bursts every 20-30 seconds of 600 MB/s - 1 GB/s. Below is a screenshot from the Windows' Task Manager during this operation.

Windows' Task Manager during ClickHouse Import


The following were the fastest times I saw after running each query multiple times on the MergeTree table.

The following completed in 0.088 seconds.

SELECT   cab_type,
FROM     trips_mergetree
GROUP BY cab_type;

The following completed in 0.51 seconds.

SELECT   passenger_count,
FROM     trips_mergetree
GROUP BY passenger_count;

The following completed in 0.424 seconds.

SELECT   passenger_count,
         toYear(pickup_date) AS year,
FROM     trips_mergetree
GROUP BY passenger_count,

The following completed in 1.257 seconds.

SELECT   passenger_count,
         toYear(pickup_date) AS year,
         round(trip_distance) AS distance,
FROM     trips_mergetree
GROUP BY passenger_count,
ORDER BY year,
         count(*) DESC;
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.