1.1 Billion Taxi Rides using DuckDB

DuckDB is an in-process database. Rather than relying on a server of its own, it's used as a client. The client can work with data in memory, within DuckDB's internal file format, database servers from other software developers and cloud storage services such as AWS S3.

This choice to not centralise DuckDB's data within its own server, paired with being distributed as a single binary, makes installing and working with DuckDB much less complex than say, standing up a Hadoop Cluster.

The project isn't aimed at very large datasets. Despite this, its ergonomics are enticing enough and it does so much to reduce engineering time that workarounds are worth considering. The rising popularity of analysis-ready, cloud-optimised Parquet files is removing the need for substantial hardware when dealing with datasets in the 100s of GBs or larger.

DuckDB is primarily the work of Mark Raasveldt and Hannes Mühleisen. It's made up of a million lines of C++ and runs as a stand-alone binary. Development is very active with the commit count on its GitHub repo doubling nearly every year since it began in 2018. DuckDB uses PostgreSQL's SQL parser, Google’s RE2 regular expression engine and SQLite's shell.

SQLite supports five data types, NULL, INTEGER, REAL, TEXT and BLOB. I was always frustrated by this as working with time would require transforms in every SELECT statement and not being able to describe a field as a boolean meant analysis software couldn't automatically recognise and provide specific UI controls and visualisations of these fields.

Thankfully, DuckDB supports 25 data types out of the box and more can be added via extensions.

A min-max index is created for every column segment in DuckDB. This index type is how most OLAP database engines answer aggregate queries so quickly. Parquet and JSON extensions are shipped in the official build and their usage is well documented. Both Snappy and ZStandard compression are supported for Parquet files.

DuckDB's documentation is well-organised and refreshingly terse with examples next to most descriptions.

In this post, I'm going to see how fast DuckDB can run the 1.1 billion taxi rides benchmark. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, Hydrolix, kdb+/q, MapD / OmniSci / HEAVY.AI, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

My Workstation

For this benchmark, I'm using a 6 GHz Intel Core i9-14900K CPU. It has 8 performance cores and 16 efficiency cores with a total of 32 threads and 32 MB of L2 cache. It has a liquid cooler attached and is housed in a spacious, full-sized, Cooler Master HAF 700 computer case. I've come across videos on YouTube where people have managed to overclock the i9-14900KF to 9.1 GHz.

The system has 48 GB of DDR5 RAM clocked at 5,200 MHz and a 5th-generation, Crucial T700 4 TB NVMe M.2 SSD which can read at speeds up to 12,400 MB/s. There is a heatsink on the SSD to help keep its temperature down. This is my system's C drive.

There is also a 2 TB SSD connected via a SATA interface which contains the original taxi rides dataset. This drive peaks around 5-600 MB/s. This is my system's D drive.

The system is powered by a 1,200-watt, fully modular, Corsair Power Supply and is sat on an ASRock Z790 Pro RS Motherboard.

I'm running Ubuntu 22 LTS via Microsoft's Ubuntu for Windows on Windows 11 Pro. In case you're wondering why I don't run a Linux-based desktop as my primary work environment, I'm still using an Nvidia GTX 1080 GPU which has better driver support on Windows and I use ArcGIS Pro from time to time which only supports Windows natively.

DuckDB Up & Running

I'll first install some dependencies that will be used in this post.

$ sudo apt update
$ sudo apt install \
    build-essential \
    cmake \
    pigz \
    python3-virtualenv \
    zip

I'll then download and install the official binary for DuckDB v0.10.0 below.

$ cd ~
$ wget -c https://github.com/duckdb/duckdb/releases/download/v0.10.0/duckdb_cli-linux-amd64.zip
$ unzip -j duckdb_cli-linux-amd64.zip
$ chmod +x duckdb
$ ~/duckdb

INSTALL parquet;

$ vi ~/.duckdbrc

.timer on
.width 180
LOAD parquet;

Importing 1.1 Billion Trips into DuckDB

The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over six years. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.

I'll read the source data from /mnt/d/taxi which is on the SATA-connected SSD and create a DuckDB file /mnt/c/taxi/taxi.duckdb on the 5th-generation NVMe drive.

$ cd /mnt/d/taxi/
$ vi create.sql

Rather than rely on DuckDB's type inference, I'll create a table for the data ahead of time. This will ensure I have the right granularity of data types for each column.

CREATE OR REPLACE TABLE trips (
    trip_id                 BIGINT,
    vendor_id               VARCHAR,

    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,

    store_and_fwd_flag      VARCHAR,
    rate_code_id            BIGINT,
    pickup_longitude        DOUBLE,
    pickup_latitude         DOUBLE,
    dropoff_longitude       DOUBLE,
    dropoff_latitude        DOUBLE,
    passenger_count         BIGINT,
    trip_distance           DOUBLE,
    fare_amount             DOUBLE,
    extra                   DOUBLE,
    mta_tax                 DOUBLE,
    tip_amount              DOUBLE,
    tolls_amount            DOUBLE,
    ehail_fee               DOUBLE,
    improvement_surcharge   DOUBLE,
    total_amount            DOUBLE,
    payment_type            VARCHAR,
    trip_type               VARCHAR,
    pickup                  VARCHAR,
    dropoff                 VARCHAR,

    cab_type                VARCHAR,

    precipitation           BIGINT,
    snow_depth              BIGINT,
    snowfall                BIGINT,
    max_temperature         BIGINT,
    min_temperature         BIGINT,
    average_wind_speed      BIGINT,

    pickup_nyct2010_gid     BIGINT,
    pickup_ctlabel          VARCHAR,
    pickup_borocode         BIGINT,
    pickup_boroname         VARCHAR,
    pickup_ct2010           VARCHAR,
    pickup_boroct2010       BIGINT,
    pickup_cdeligibil       VARCHAR,
    pickup_ntacode          VARCHAR,
    pickup_ntaname          VARCHAR,
    pickup_puma             VARCHAR,

    dropoff_nyct2010_gid    BIGINT,
    dropoff_ctlabel         VARCHAR,
    dropoff_borocode        BIGINT,
    dropoff_boroname        VARCHAR,
    dropoff_ct2010          VARCHAR,
    dropoff_boroct2010      VARCHAR,
    dropoff_cdeligibil      VARCHAR,
    dropoff_ntacode         VARCHAR,
    dropoff_ntaname         VARCHAR,
    dropoff_puma            VARCHAR);

$ ~/duckdb /mnt/c/taxi/taxi.duckdb < create.sql
$ ~/duckdb /mnt/c/taxi/taxi.duckdb

INSERT INTO trips
    SELECT *
    FROM read_csv('trips_x*.csv.gz');

Reads peaked at 60 MB/s but sat around ~40 MB/s and writes sat around ~20 MB/s on the SATA drive. CPU usage sat around 25% and DuckDB's RAM consumption quickly grew to around 22 GB. At one point, I could see I was about to run out of RAM so I cancelled the job.

One CSV At A Time

My second attempt at importing the CSVs into DuckDB was to import one CSV file at a time.

$ for FILENAME in trips_x*.csv.gz; do
    echo $FILENAME
    ~/duckdb -c "INSERT INTO trips
                    SELECT *
                    FROM   READ_CSV('$FILENAME');" \
            /mnt/c/taxi/taxi.duckdb
  done

Initially, this worked well. Reads peaked at 60 MB/s but sat around ~40 MB/s and writes sat around ~20 MB/s on the SATA drive. DuckDB's RAM consumption peaked at around 10 GB during each import. CPU usage initially sat around 15-25% but eventually hit 100% as more and more CSVs were imported into DuckDB.

When CPU consumption hit 100%, I noticed the overall clock rate dropped to 3.5 GHz and then again to 3.4 GHz before recovering into the 4 GHz range.

Windows' Task Manager during DuckDB Import

Below is a screenshot from Speccy from around the same time.

The import progressed well until the 36th CSV file but from there, progress halted with the CPU sat at 100%. The DuckDB file was around 70 GB at this point. After 45 minutes of very little disk activity and the CPU running at full throttle, I cancelled the job.

Building Parquet Files

My third attempt at benchmarking this dataset was to just convert each CSV file into a Parquet file and then query those.

SSDs are consumables and at some point, can be expected to fail given enough usage. The 5th generation SSD cost me €616 whereas a new SATA-based SSD with 2 TB of capacity runs around ~€100. For this reason, I'll run the benchmarks on the faster drive but I'll build the Parquet files on the cheaper drive.

Normally I sort data by longitude as it results in the smaller Parquet files but to keep RAM consumption at a minimum, I'll not specify any particular order to DuckDB.

$ for FILENAME in trips_x*.csv.gz; do
    echo $FILENAME

    OUT=`echo $FILENAME | sed 's/.csv.gz//g'`

    touch working.duckdb
    rm working.duckdb

    ~/duckdb working.duckdb < create.sql
    ~/duckdb -c "INSERT INTO trips
                    SELECT *
                    FROM read_csv('$FILENAME');

                 SET preserve_insertion_order=false;
                 SET memory_limit='8GB';

                 COPY(
                     SELECT *
                     FROM   trips
                 ) TO '$OUT.pq' (FORMAT 'PARQUET',
                                 CODEC  'ZSTD',
                                 ROW_GROUP_SIZE 15000);" \
            working.duckdb
  done

The above peaked at around 20 GB in RAM consumption. The CSVs read anywhere between 40-120 MB/s and writes peaked around 120 MB/s at any one time. CPU consumption barely broke 20% during certain operations and would peak at 100% across all cores during other parts.

Most of the GZIP-compressed CSV files are 1.9 GB in size and the resulting PQ files are around 1.5 GB.

One Problematic CSV

Unfortunately, DuckDB got stuck on trips_xbj.csv.gz . The CPU was maxed out with a working DuckDB file stuck at 868 MB of the ~2 GB that would normally be generated. I suspected the CSV was too large so I tried splitting it up into several 500K-line CSV files.

$ pigz -dc trips_xbj.csv.gz \
    | split --lines=500000 \
            --filter="pigz > trips_xbj_\$FILE.csv.gz"

$ for FILENAME in trips_xbj_x*.csv.gz; do
    echo $FILENAME

    OUT=`echo $FILENAME | sed 's/.csv.gz//g'`

    touch working.duckdb
    rm working.duckdb

    ~/duckdb working.duckdb < create.sql
    ~/duckdb -c "INSERT INTO trips
                    SELECT *
                    FROM read_csv('$FILENAME');

                 SET preserve_insertion_order=false;
                 SET memory_limit='8GB';

                 COPY(
                     SELECT *
                     FROM   trips
                 ) TO '$OUT.pq' (FORMAT 'PARQUET',
                                 CODEC  'ZSTD',
                                 ROW_GROUP_SIZE 15000);" \
            working.duckdb
  done

When I ran the above DuckDB ended up getting stuck on trips_xbj_xas.csv.gz with the same symptoms as before.

Compiling DuckDB

I decided to compile the main branch of DuckDB and see if it could process the above CSV without issue.

$ git clone https://github.com/duckdb/duckdb.git ~/duckdb_source
$ cd ~/duckdb_source

$ mkdir -p build/release

$ cmake \
    ./CMakeLists.txt \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DEXTENSION_STATIC_BUILD=1 \
    -DBUILD_PARQUET_EXTENSION=1 \
    -B build/release

$ CMAKE_BUILD_PARALLEL_LEVEL=$(nproc) \
    cmake --build build/release

I re-installed the Parquet extension for this distinct DuckDB build.

$ ~/duckdb_source/build/release/duckdb

INSTALL parquet;

I then re-ran the CSV to Parquet conversion process on the problematic CSV.

$ cd /mnt/d/taxi

$ for FILENAME in trips_xbj_xas.csv.gz; do
    echo $FILENAME

    OUT=`echo $FILENAME | sed 's/.csv.gz//g'`

    touch working.duckdb
    rm working.duckdb

    ~/duckdb_source/build/release/duckdb working.duckdb < create.sql
    ~/duckdb_source/build/release/duckdb \
        -c "INSERT INTO trips
                SELECT *
                FROM read_csv('$FILENAME');

            SET preserve_insertion_order=false;
            SET memory_limit='8GB';

            COPY(
                SELECT *
                FROM   trips
            ) TO '$OUT.pq' (FORMAT 'PARQUET',
                            CODEC  'ZSTD',
                            ROW_GROUP_SIZE 15000);" \
        working.duckdb
  done

I was hoping that there was some sort of CSV parsing bug that had been fixed sometime in the past few weeks since the 0.10.0 release that might resolve the issue. Instead, I got an error message that I hadn't seen with the official binary release.

Conversion Error: CSV Error on Line: 220725
Error when converting column "column20".
Could not convert string "Cash" to 'BIGINT'

  file=trips_xbj_xas.csv.gz
  delimiter = , (Auto-Detected)
  quote = \0 (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = false (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=0
  all_varchar=0

The column names start from column00. The 21st field is payment_type which is a VARCHAR. I'm not sure why DuckDB would want to cast it to a BIGINT.

I've removed the readability spacing between the fields in create.sql. The first row is the CREATE TABLE statement so the 22nd line is the 21st field name and type.

$ grep -n -B1 -A1 payment create.sql

21-         total_amount            DOUBLE,
22:         payment_type            VARCHAR,
23-         trip_type               VARCHAR,

The field is a VARCHAR so I'm confused as to why DuckDB would want to cast it to a BIGINT.

Ignoring Errors

I'll run the CSV to Parquet conversion again with ignore_errors=true to see if I can get past this record and see if I can get the scope of how widespread this issue is.

$ for FILENAME in trips_xbj_xas.csv.gz; do
    echo $FILENAME

    OUT=`echo $FILENAME | sed 's/.csv.gz//g'`

    touch working.duckdb
    rm working.duckdb

    ~/duckdb_source/build/release/duckdb working.duckdb < create.sql
    ~/duckdb_source/build/release/duckdb \
        -c "INSERT INTO trips
                SELECT *
                FROM read_csv('$FILENAME',
                              ignore_errors=true);

            SET preserve_insertion_order=false;
            SET memory_limit='8GB';

            COPY(
                SELECT *
                FROM   trips
            ) TO '$OUT.pq' (FORMAT 'PARQUET',
                            CODEC  'ZSTD',
                            ROW_GROUP_SIZE 15000);" \
        working.duckdb
  done

There should be 500K records in the resulting Parquet file but there aren't even 280K.

SELECT COUNT(*)
FROM READ_PARQUET('trips_xbj_xas.pq');

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       279904 │
└──────────────┘

I also found both trips_xbs.pq and trips_xbv.pq were collectively missing around 2M records as well.

$ for FILENAME in trips_x*.pq; do
    echo $FILENAME, `~/duckdb -csv -c "SELECT COUNT(*) FROM READ_PARQUET('$FILENAME')"`
  done

...
trips_xbs.pq, .. count_star() 19999999
trips_xbt.pq, .. count_star() 20000000
trips_xbu.pq, .. count_star() 20000000
trips_xbv.pq, .. count_star() 17874858
...

Troublesome Records

This is the portion of the record DuckDB was complaining about.

$ gunzip -c trips_xbj_xas.csv.gz \
    | head -n220725 \
    | tail -n1 \
    > unparsable.csv
$ grep -o '.*Cash' unparsable.csv

26948130,CMT,2009-01-10 11:24:50,2009-01-10 11:28:39,,,-73.980964,40.779387999999997,-73.97072,40.784225999999997,1,0.90000000000000002,4.5,0,,0,0,,,4.5,Cash

Below I'll use DuckDB to read the record in and see which field the 'Cash' value lands on.

$ ~/duckdb

.mode line

SELECT COLUMNS(c -> c LIKE 'column0%' OR
                    c LIKE 'column1%' or
                    c LIKE 'column20')
FROM READ_CSV('unparsable.csv');

column00 = 26948130
column01 = CMT
column02 = 2009-01-10 11:24:50
column03 = 2009-01-10 11:28:39
column04 =
column05 =
column06 = -73.980964
column07 = 40.779388
column08 = -73.97072
column09 = 40.784226
column10 = 1
column11 = 0.9
column12 = 4.5
column13 = 0
column14 =
column15 = 0
column16 = 0
column17 =
column18 =
column19 = 4.5
column20 = Cash

The 'Cash' value is on the 21st column so this doesn't look like an alignment issue.

The CSV file can be parsed by a comma delimiter alone as none of the cell values contain commas. Below I'll annotate the first 21 rows of the CSV data with a row number as a double check that 'Cash' should appear in the 21st column.

$ python3

s = '''26948130,CMT,2009-01-10 11:24:50,2009-01-10 11:28:39,,,-73.980964,40.779387999999997,-73.97072,40.784225999999997,1,0.90000000000000002,4.5,0,,0,0,,,4.5,Cash'''

for n, v in enumerate(s.split(','), start=1):
    print(n, v)

1 26948130
2 CMT
3 2009-01-10 11:24:50
4 2009-01-10 11:28:39
5
6
7 -73.980964
8 40.779387999999997
9 -73.97072
10 40.784225999999997
11 1
12 0.90000000000000002
13 4.5
14 0
15
16 0
17 0
18
19
20 4.5
21 Cash

Converting with ClickHouse

I'm not sure how to resolve the issue with DuckDB in a timely manner. As a workaround, I'll use ClickHouse to produce the remaining Parquet files.

$ cd ~
$ curl https://clickhouse.com/ | sh
$ cd /mnt/d/taxi
$ ~/clickhouse local

CREATE TABLE trips (
    trip_id                 INT,
    vendor_id               VARCHAR(3),
    pickup_datetime         TIMESTAMP,
    dropoff_datetime        TIMESTAMP,
    store_and_fwd_flag      VARCHAR(1),
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(18,14),
    pickup_latitude         DECIMAL(18,14),
    dropoff_longitude       DECIMAL(18,14),
    dropoff_latitude        DECIMAL(18,14),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(18,6),
    fare_amount             DECIMAL(18,6),
    extra                   DECIMAL(18,6),
    mta_tax                 DECIMAL(18,6),
    tip_amount              DECIMAL(18,6),
    tolls_amount            DECIMAL(18,6),
    ehail_fee               DECIMAL(18,6),
    improvement_surcharge   DECIMAL(18,6),
    total_amount            DECIMAL(18,6),
    payment_type            VARCHAR(3),
    trip_type               SMALLINT,
    pickup                  VARCHAR(50),
    dropoff                 VARCHAR(50),

    cab_type                VARCHAR(6),

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10),
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13),
    pickup_ct2010           VARCHAR(6),
    pickup_boroct2010       VARCHAR(7),
    pickup_cdeligibil       VARCHAR(1),
    pickup_ntacode          VARCHAR(4),
    pickup_ntaname          VARCHAR(56),
    pickup_puma             VARCHAR(4),

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10),
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13),
    dropoff_ct2010          VARCHAR(6),
    dropoff_boroct2010      VARCHAR(7),
    dropoff_cdeligibil      VARCHAR(1),
    dropoff_ntacode         VARCHAR(4),
    dropoff_ntaname         VARCHAR(56),
    dropoff_puma            VARCHAR(4)
) ENGINE=Log;

INSERT INTO trips
SELECT *
FROM file('trips_xbj_xas.csv.gz', CSV);

SELECT *
FROM trips
INTO OUTFILE 'trips_xbj_xas.pq'
FORMAT Parquet;

500000 rows in set. Elapsed: 2.427 sec. Processed 500.00 thousand rows, 290.10 MB (205.99 thousand rows/s., 119.51 MB/s.)
Peak memory usage: 418.85 MiB.

I tried to get DuckDB to read the above Parquet file but it wasn't able to.

$ ~/duckdb

SELECT   cab_type,
         COUNT(*)
FROM     READ_PARQUET('trips_xbj_xas.pq')
GROUP BY cab_type;

Error: Invalid Error: Unsupported compression codec "7". Supported options are uncompressed, gzip, snappy or zstd

Examining ClickHouse's Parquet File

I've been working on a Parquet debugging tool. I'll use it to examine the compression scheme ClickHouse is using.

$ git clone https://github.com/marklit/pqview \
    ~/pqview

$ virtualenv ~/.pqview
$ source ~/.pqview/bin/activate
$ python3 -m pip install \
          -r ~/pqview/requirements.txt

$ python3 ~/pqview/main.py \
           most-compressed \
           trips_xbj_xas.pq \
    | grep compression

compression: LZ4

ClickHouse changed its default Parquet compression scheme from Snappy to LZ4 at some point last year. LZ4 is great but DuckDB doesn't support it. I'll rebuild the Parquet file using a ClickHouse compatibility setting that'll revert to using Snappy compression instead.

SELECT *
FROM trips
INTO OUTFILE 'trips_xbj_xas.pq'
FORMAT Parquet
SETTINGS compatibility='23.2';

$ python3 ~/pqview/main.py \
          most-compressed \
          trips_xbj_xas.pq \
    | grep compression

compression: SNAPPY

With ClickHouse able to produce the Parquet files using Snappy, I also re-processed I both trips_xbs.pq and trips_xbv.pq. I had issues importing them into ClickHouse with trips_xbv.pq stalling at the 15.73M-row point. I had to break them both into 4 x 5M-line CSV files to get them imported without issue.

ClickHouse exports timestamps as uint64s to Parquet instead of timestamps. I used DuckDB to convert these fields back into timestamps. Please excuse the lack of a for loop.

$ ~/duckdb_source/build/release/duckdb working.duckdb

SET preserve_insertion_order=false;
SET memory_limit='8GB';

CREATE OR REPLACE TABLE trips AS
    SELECT * EXCLUDE (pickup_datetime, dropoff_datetime),
           MAKE_TIMESTAMP(pickup_datetime * 1000000) AS pickup_datetime,
           MAKE_TIMESTAMP(dropoff_datetime * 1000000) AS dropoff_datetime,
    FROM READ_PARQUET('trips_xbj_xas.pq');

COPY(
    SELECT *
    FROM trips
) TO 'trips_xbj_xas_ts.pq' (FORMAT 'PARQUET',
                            CODEC  'ZSTD',
                            ROW_GROUP_SIZE 15000);

CREATE OR REPLACE TABLE trips AS
    SELECT * EXCLUDE (pickup_datetime, dropoff_datetime),
           MAKE_TIMESTAMP(pickup_datetime * 1000000) AS pickup_datetime,
           MAKE_TIMESTAMP(dropoff_datetime * 1000000) AS dropoff_datetime,
    FROM READ_PARQUET('trips_xbs.pq');

COPY(
    SELECT *
    FROM trips
) TO 'trips_xbs_ts.pq' (FORMAT 'PARQUET',
                        CODEC  'ZSTD',
                        ROW_GROUP_SIZE 15000);

CREATE OR REPLACE TABLE trips AS
    SELECT * EXCLUDE (pickup_datetime, dropoff_datetime),
           MAKE_TIMESTAMP(pickup_datetime * 1000000) AS pickup_datetime,
           MAKE_TIMESTAMP(dropoff_datetime * 1000000) AS dropoff_datetime,
    FROM READ_PARQUET('trips_xbv.pq');

COPY(
    SELECT *
    FROM trips
) TO 'trips_xbv_ts.pq' (FORMAT 'PARQUET',
                        CODEC  'ZSTD',
                        ROW_GROUP_SIZE 15000);

Double Checking

There should be 1,086,709,191 yellow cab records and 26,943,827 green cab records in a complete 1.1 billion taxi rides dataset. I ran the following to check that this was the case with these Parquet files. Mark Raasveldt fixed a progress bar bug a few days ago so I'll use the newly compiled version of DuckDB from here on.

$ ~/duckdb_source/build/release/duckdb

SELECT   cab_type,
         COUNT(*)
FROM     READ_PARQUET('trips_x*.pq')
GROUP BY cab_type;

┌──────────┬──────────────┐
│ cab_type │ count_star() │
│ varchar  │    int64     │
├──────────┼──────────────┤
│ yellow   │   1086709191 │
│ green    │     26943827 │
└──────────┴──────────────┘

The Parquet Benchmark

I copied the Parquet files onto the 5th-generation SSD before running the following benchmark. The following were the fastest times I saw after running each query multiple times on the Parquet files.

The following completed in 36.942 seconds.

SELECT   cab_type,
         COUNT(*)
FROM     READ_PARQUET('trips_x*.pq')
GROUP BY cab_type;

The following completed in 51.085 seconds.

SELECT   passenger_count,
         AVG(total_amount)
FROM     READ_PARQUET('trips_x*.pq')
GROUP BY passenger_count;

The following completed in 60.656 seconds.

SELECT   passenger_count,
         DATE_PART('year', pickup_datetime) AS year,
         COUNT(*)
FROM     READ_PARQUET('trips_x*.pq')
GROUP BY passenger_count,
         year;

The following completed in 92.124 seconds.

SELECT   passenger_count,
         DATE_PART('year', pickup_datetime) AS year,
         ROUND(trip_distance) AS distance,
         COUNT(*)
FROM     READ_PARQUET('trips_x*.pq')
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

An Alternative Code Path

After the initial publication of this post, Mark Raasveldt reached out and let me know that the COPY command uses an alternative code path to READ_CSV() and that loading the dataset into DuckDB's internal format via this method would be both more reliable and result in faster benchmark numbers.

I removed any .csv.gz files from the SATA-connected SSD that weren't in the original dataset and made sure only the 56 original files remained to avoid record duplication. I then ran the following.

$ touch working.duckdb
$ rm working.duckdb

$ ~/duckdb_source/build/release/duckdb \
    working.duckdb \
    < create.sql

$ ~/duckdb_source/build/release/duckdb \
    working.duckdb \
    -c "COPY trips FROM 'trips*.csv.gz';"

RAM consumption quickly grew to 22 GB before levelling off. The SSD was writing at 50-150 MB/s and usually reading at 5-10 MB/s while the CPU was almost idle at 5-10%. The job ran for ~3-4 hours before reaching 91% and then abruptly being killed. I'm not sure why this happened as there was ample RAM available on the system.

During the ~3.5 hours there were sustained writes of ~70 MB/s on average which means the SSD would have had ~860 GB written to it despite the DuckDB file only growing to ~90 GB. Given SSDs can wear out this is something to take into consideration if you're using your own hardware.

I made a second attempt at using the COPY command. This time, I imported files one at a time. This job ended up finishing in 75 minutes. The RAM consumption only peaked around 6 GB and the CPU hovered between 15-30% during this time.

$ touch working.duckdb
$ rm working.duckdb

$ ~/duckdb_source/build/release/duckdb \
    working.duckdb \
    < create.sql

$ for FILENAME in trips_*.csv.gz; do
    echo `date`, $FILENAME
    ~/duckdb_source/build/release/duckdb \
        -c "COPY trips FROM '$FILENAME';" \
        working.duckdb
  done

The resulting DuckDB file was 113 GB. I checked the record count before copying it over to the 5th-generation SSD.

SELECT   cab_type,
         COUNT(*)
FROM     trips
GROUP BY cab_type;

┌──────────┬──────────────┐
│ cab_type │ count_star() │
│ varchar  │    int64     │
├──────────┼──────────────┤
│ green    │     26943827 │
│ yellow   │   1086709191 │
└──────────┴──────────────┘

One interesting aspect of DuckDB is its wide variety of compression schemes. Some of these, like Adaptive Lossless Floating-Point Compression (ALP), are unique to DuckDB. Any one field can have any number of these schemes used to compress their individual row-groups. DuckDB decides automatically which scheme to use on a per-row-group-basis.

Below are the row-group counts for each compress scheme for each of the fields in the trips table.

WITH pivot_alias AS (
    PIVOT    PRAGMA_STORAGE_INFO('trips')
    ON       compression
    USING    COUNT(*)
    GROUP BY column_name,
             column_id
    ORDER BY column_id
)
SELECT * EXCLUDE(column_id)
FROM pivot_alias;

┌───────────────────────┬───────┬───────┬────────────┬──────────┬────────────┬───────┬───────┬──────────────┐
│      column_name      │  ALP  │ ALPRD │ BitPacking │ Constant │ Dictionary │ FSST  │  RLE  │ Uncompressed │
│        varchar        │ int64 │ int64 │   int64    │  int64   │   int64    │ int64 │ int64 │    int64     │
├───────────────────────┼───────┼───────┼────────────┼──────────┼────────────┼───────┼───────┼──────────────┤
│ trip_id               │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ vendor_id             │     0 │     0 │          0 │     9426 │       9425 │     1 │     0 │            0 │
│ pickup_datetime       │     0 │     0 │      28218 │     9426 │          0 │     0 │     0 │            0 │
│ dropoff_datetime      │     0 │     0 │      28218 │     9426 │          0 │     0 │     0 │            0 │
│ store_and_fwd_flag    │     0 │     0 │          0 │      888 │       9283 │   143 │     0 │         8538 │
│ rate_code_id          │     0 │     0 │        439 │    10965 │          0 │     0 │  7445 │            3 │
│ pickup_longitude      │ 18636 │     0 │          0 │        0 │          0 │     0 │     0 │         9426 │
│ pickup_latitude       │ 15872 │  2663 │          0 │        0 │          0 │     0 │     0 │         9426 │
│ dropoff_longitude     │ 19353 │     0 │          0 │        0 │          0 │     0 │     0 │         9426 │
│ dropoff_latitude      │ 16789 │  2663 │          0 │        0 │          0 │     0 │     0 │         9426 │
│ passenger_count       │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ trip_distance         │  9426 │     0 │          0 │     9426 │          0 │     0 │     0 │            0 │
│ fare_amount           │  9426 │     0 │          0 │     9426 │          0 │     0 │     0 │            0 │
│ extra                 │  9426 │     0 │          0 │     9426 │          0 │     0 │     0 │            0 │
│ mta_tax               │   197 │     0 │          0 │    10085 │          0 │     0 │  8180 │          390 │
│ tip_amount            │  9426 │     0 │          0 │     9426 │          0 │     0 │     0 │            0 │
│ tolls_amount          │  5907 │     0 │          0 │     9426 │          0 │     0 │  3519 │            0 │
│ ehail_fee             │     0 │     0 │          0 │    18852 │          0 │     0 │     0 │            0 │
│ improvement_surcharge │   192 │     0 │          0 │    18102 │          0 │     0 │   549 │            9 │
│ total_amount          │  9427 │     0 │          0 │     9426 │          0 │     0 │     0 │            0 │
│ payment_type          │     0 │     0 │          0 │     9426 │       9425 │     1 │     0 │            0 │
│ trip_type             │     0 │     0 │          0 │     9340 │       9416 │    10 │     0 │           86 │
│ pickup                │     0 │     0 │          0 │        0 │       8798 │   628 │     0 │         9426 │
│ dropoff               │     0 │     0 │          0 │        0 │       8311 │  1115 │     0 │         9426 │
│ cab_type              │     0 │     0 │          0 │     9426 │       9426 │     0 │     0 │            0 │
│ precipitation         │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ snow_depth            │     0 │     0 │       2362 │    16249 │          0 │     0 │   241 │            0 │
│ snowfall              │     0 │     0 │       2262 │    15614 │          0 │     0 │   976 │            0 │
│ max_temperature       │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ min_temperature       │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ average_wind_speed    │     0 │     0 │       9426 │     9426 │          0 │     0 │     0 │            0 │
│ pickup_nyct2010_gid   │     0 │     0 │       9426 │        0 │          0 │     0 │     0 │         9426 │
│ pickup_ctlabel        │     0 │     0 │          0 │        0 │       7685 │  1741 │     0 │         9426 │
│ pickup_borocode       │     0 │     0 │       9426 │        0 │          0 │     0 │     0 │         9426 │
│ pickup_boroname       │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ pickup_ct2010         │     0 │     0 │          0 │        0 │       9323 │   187 │     0 │         9426 │
│ pickup_boroct2010     │     0 │     0 │      18354 │        0 │          0 │     0 │   235 │         9426 │
│ pickup_cdeligibil     │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ pickup_ntacode        │     0 │     0 │          0 │        0 │       9403 │    40 │     0 │         9426 │
│ pickup_ntaname        │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ pickup_puma           │     0 │     0 │          0 │        0 │       9424 │     3 │     0 │         9426 │
│ dropoff_nyct2010_gid  │     0 │     0 │        647 │        0 │          0 │     0 │  8779 │         9426 │
│ dropoff_ctlabel       │     0 │     0 │          0 │        0 │       6843 │  2583 │     0 │         9426 │
│ dropoff_borocode      │     0 │     0 │       9426 │        0 │          0 │     0 │     0 │         9426 │
│ dropoff_boroname      │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ dropoff_ct2010        │     0 │     0 │          0 │        0 │       9221 │   369 │     0 │         9426 │
│ dropoff_boroct2010    │     0 │     0 │          0 │        0 │       9301 │   247 │     0 │         9426 │
│ dropoff_cdeligibil    │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ dropoff_ntacode       │     0 │     0 │          0 │        0 │       9407 │    31 │     0 │         9426 │
│ dropoff_ntaname       │     0 │     0 │          0 │        0 │       9426 │     0 │     0 │         9426 │
│ dropoff_puma          │     0 │     0 │          0 │        0 │       9421 │     8 │     0 │         9426 │
├───────────────────────┴───────┴───────┴────────────┴──────────┴────────────┴───────┴───────┴──────────────┤
│ 51 rows                                                                                         9 columns │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘

The Internal Format Benchmark

The following were the fastest times I saw after running each query multiple times on the DuckDB table.

The following completed in 0.498 seconds.

SELECT   cab_type,
         COUNT(*)
FROM     trips
GROUP BY cab_type;

The following completed in 0.234 seconds.

SELECT   passenger_count,
         AVG(total_amount)
FROM     trips
GROUP BY passenger_count;

The following completed in 0.734 seconds.

SELECT   passenger_count,
         DATE_PART('year', pickup_datetime) AS year,
         COUNT(*)
FROM     trips
GROUP BY passenger_count,
         year;

The following completed in 1.334 seconds.

SELECT   passenger_count,
         DATE_PART('year', pickup_datetime) AS year,
         ROUND(trip_distance) AS distance,
         COUNT(*)
FROM     trips
GROUP BY passenger_count,
         year,
         distance
ORDER BY year,
         count(*) DESC;

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.