1.1 Billion Taxi Rides on a Large Redshift Cluster

Over the past few months I've been benchmarking a dataset of 1.1 billion taxi journeys made in New York City over a six year period on a number of data stores and cloud services. Among these has been AWS Redshift. Up until now I've been using their free-tier and single-compute node clusters. While these make Data Warehousing very inexpensive they aren't going to be representative of the incredible query speeds which can be achieved with Redshift.

In this post I'll be looking at how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records from the dataset I've put together.

Optimising Source Data

I'll be using the 56 GZIP'ed CSV files I originally put together in my Billion Taxi Rides on Redshift blog post a few months back. The cluster will have 6 compute nodes with 16 slices each so an equivalent number of files or multiple of this will ensure each slice in the cluster is utilised. So to do that I'll re-package my 56 GZIP files as 96 smaller GZIP files.

In the ~/taxi-trips folder on my system I have the original 56 files.

$ ls -lh ~/taxi-trips

-rw-rw-r-- 1 mark mark 1.9G Jun  1 10:23 trips_xaa.csv.gz
-rw-rw-r-- 1 mark mark 1.9G Jun  1 10:23 trips_xab.csv.gz
...
-rw-rw-r-- 1 mark mark 1.3G Jun  1 10:48 trips_xcd.csv.gz

The 56 gzip files have around 20M lines of data in each of them. I'll create 11.7M-line files so that the data fits across 96 files fairly evenly.

$ cd ~/taxi-trips
$ mkdir 96
$ gunzip -c trips_x*.csv.gz | \
      split -l 11700000 \
            --filter="gzip > 96/trips_96_\$FILE.csv.gz"

Bandwidth-Saturating S3 Uploads

Once those files were generated I uploaded them to an S3 bucket. Amazon were kind enough to point out a change to my AWS CLI configuration that would help saturate my upload bandwidth. The following will allow for 100 concurrent requests to run when interacting with S3.

$ aws configure set \
    default.s3.max_concurrent_requests \
    100

The following will sync the 96 folder with the csv folder on S3.

$ aws s3 sync \
    ~/taxi-trips/96/ \
    s3://<s3_bucket>/csv/

I was monitoring the transmission speed on my main Ethernet interface during the transfer and it was sending at a solid 63 MBit/s with at most 2 MBit/s +/- of deviation for the entirety of the upload. I thought my Internet package entitled me to 50 Mbit/s up so this surprised me. But the upload duration was 39 seconds shy of five hours flat. Using wall clock time the effective upload rate was ~46 MBit/s. This is still a huge improvement over the 17.38 Mbit/s upload speed I saw a few weeks ago but it's a shame to see so much overhead.

For the record I was using version 1.10.34 of the AWS CLI tool which was released 5 days ago as of this writing.

$ virtualenv aws
$ source aws/bin/activate
$ pip install --upgrade awscli
$ aws --version

aws-cli/1.10.34 Python/2.7.6 Linux/3.19.0-25-generic botocore/1.4.24

I also made sure the SSD drive the data was stored on was performing well enough to not be a bottleneck by running a byte count on a 1.9GB file on the drive.

$ time (cat trips_xaa.csv.gz | wc -c)

2024092233

real    0m1.672s
user    0m0.000s
sys     0m1.668s

Building a Manifest File

Redshift will need a manifest file stating where all 96 GZIP files are being kept.

$ vi trips.manifest

{
    "entries": [
        {"url": "s3://<s3_bucket>/csv/trips_96_xaa.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xab.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xac.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xad.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xae.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xaf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xag.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xah.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xai.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xaj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xak.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xal.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xam.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xan.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xao.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xap.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xaq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xar.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xas.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xat.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xau.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xav.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xaw.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xax.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xay.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xaz.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xba.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbb.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbc.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbd.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbe.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbg.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbh.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbi.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbk.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbl.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbm.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbn.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbo.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbp.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbr.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbs.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbt.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbu.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbv.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbw.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbx.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xby.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xbz.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xca.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcb.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcc.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcd.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xce.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcg.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xch.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xci.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xck.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcl.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcm.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcn.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xco.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcp.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcr.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcs.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xct.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcu.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcv.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcw.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcx.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcy.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xcz.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xda.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdb.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdc.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdd.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xde.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdf.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdg.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdh.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdi.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdj.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdk.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdl.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdm.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdn.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdo.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdp.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdq.csv.gz", "mandatory": true},
        {"url": "s3://<s3_bucket>/csv/trips_96_xdr.csv.gz", "mandatory": true}
    ]
}

The following will upload the manifest file to the S3 bucket.

$ aws s3 cp \
    trips.manifest \
    s3://<s3_bucket>/csv/

Redshift Up & Running

For this benchmark I will be using a 6-node ds2.8xlarge Redshift cluster. The 6 nodes refer to compute nodes and don't include the leader node in that count. This cluster's compute resources combined equate to 216 vCPUs, 1,464 GB RAM, 96 TB of mechanical storage and a 19.8 GB/s of I/O throughput.

It took 20 minutes for the cluster to launch in the eu-west-1b zone. This was conducted towards the end of the European work day.

It's important to note that when you launch that you do so within a VPC. VPC environments have serious networking improvements over the classic EC2 environments.

Loading Data into Redshift

I'll set environment variables containing the username and password for the cluster that I defined when I launched it.

$ read MASTER_USERNAME
$ read MASTER_PASSWORD
$ export MASTER_USERNAME
$ export MASTER_PASSWORD

I'll then use PostgreSQL's CLI tool to connect to the cluster.

$ PGPASSWORD=$MASTER_PASSWORD \
    psql -h trips.c6rssls4aefs.eu-west-1.redshift.amazonaws.com \
         -p 5439 \
         -U $MASTER_USERNAME trips

The following is the schema for the trips table. The encodings used were suggested by Redshift's ANALYZE command. The data types used on each column were picked to be as granular as the underlying data would allow for. The original dataset does have some data quality issues so many of the DECIMAL fields allow for a much wider range of values than what would be expected on a well-groomed dataset. The sort key is not being encoded which is in-line with Amazon's guidance.

CREATE TABLE trips (
    trip_id                 INTEGER NOT NULL DISTKEY ENCODE LZO,
    vendor_id               VARCHAR(3) ENCODE LZO,

    -- Sort keys shouldn't be encoded (compressed)
    pickup_datetime         TIMESTAMP NOT NULL,

    dropoff_datetime        TIMESTAMP NOT NULL ENCODE LZO,
    store_and_fwd_flag      VARCHAR(1) ENCODE RUNLENGTH,
    rate_code_id            SMALLINT ENCODE LZO,
    pickup_longitude        DECIMAL(18,14) ENCODE MOSTLY8,
    pickup_latitude         DECIMAL(18,14) ENCODE MOSTLY8,
    dropoff_longitude       DECIMAL(18,14) ENCODE MOSTLY8,
    dropoff_latitude        DECIMAL(18,14) ENCODE MOSTLY8,
    passenger_count         SMALLINT NOT NULL DEFAULT '0' ENCODE LZO,
    trip_distance           DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    fare_amount             DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    extra                   DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    mta_tax                 DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    tip_amount              DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    tolls_amount            DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    ehail_fee               DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    improvement_surcharge   DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    total_amount            DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
    payment_type            VARCHAR(3) ENCODE RUNLENGTH,
    trip_type               SMALLINT ENCODE LZO,
    pickup                  VARCHAR(50) ENCODE LZO,
    dropoff                 VARCHAR(50) ENCODE LZO,

    cab_type                VARCHAR(6) NOT NULL ENCODE LZO,

    precipitation           SMALLINT DEFAULT '0' ENCODE LZO,
    snow_depth              SMALLINT DEFAULT '0' ENCODE LZO,
    snowfall                SMALLINT DEFAULT '0' ENCODE LZO,
    max_temperature         SMALLINT DEFAULT '0' ENCODE LZO,
    min_temperature         SMALLINT DEFAULT '0' ENCODE LZO,
    average_wind_speed      SMALLINT DEFAULT '0' ENCODE LZO,

    pickup_nyct2010_gid     SMALLINT ENCODE LZO,
    pickup_ctlabel          VARCHAR(10) ENCODE LZO,
    pickup_borocode         SMALLINT ENCODE LZO,
    pickup_boroname         VARCHAR(13) ENCODE LZO,
    pickup_ct2010           VARCHAR(6) ENCODE LZO,
    pickup_boroct2010       VARCHAR(7) ENCODE LZO,
    pickup_cdeligibil       VARCHAR(1) ENCODE RUNLENGTH,
    pickup_ntacode          VARCHAR(4) ENCODE LZO,
    pickup_ntaname          VARCHAR(56) ENCODE LZO,
    pickup_puma             VARCHAR(4) ENCODE LZO,

    dropoff_nyct2010_gid    SMALLINT ENCODE LZO,
    dropoff_ctlabel         VARCHAR(10) ENCODE LZO,
    dropoff_borocode        SMALLINT ENCODE LZO,
    dropoff_boroname        VARCHAR(13) ENCODE LZO,
    dropoff_ct2010          VARCHAR(6) ENCODE LZO,
    dropoff_boroct2010      VARCHAR(7) ENCODE LZO,
    dropoff_cdeligibil      VARCHAR(1) ENCODE RUNLENGTH,
    dropoff_ntacode         VARCHAR(4) ENCODE LZO,
    dropoff_ntaname         VARCHAR(56) ENCODE LZO,
    dropoff_puma            VARCHAR(4) ENCODE LZO,

    primary key(trip_id)
) sortkey(pickup_datetime);

The following will allow my process to use 100% of the cluster's memory.

set wlm_query_slot_count to 5;

The following will tell Redshift to load all the data from the files on S3 into the trips table.

COPY trips
  FROM 's3://<s3_bucket>/csv/trips.manifest'
  CREDENTIALS
    'aws_access_key_id=...;aws_secret_access_key=...'
  DELIMITER ','
  EMPTYASNULL
  ESCAPE
  GZIP
  MANIFEST
  MAXERROR 100000
  REMOVEQUOTES
  TRIMBLANKS
  TRUNCATECOLUMNS;

The above took 11 minutes and 13 seconds to complete.

Benchmarking Redshift

I ran each of the queries below multiple times in a row. The times reported were the fastest achieved.

\timing on

set wlm_query_slot_count to 5;

The following completed in 1.56 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY 1;

The following completed in 1.25 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY 1;

The following completed in 2.25 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime),
       count(*)
FROM trips
GROUP BY 1,
         2;

The following completed in 2.97 seconds.

SELECT passenger_count,
       extract(year from pickup_datetime),
       round(trip_distance),
       count(*)
FROM trips
GROUP BY 1,
         2,
         3
ORDER BY 2,
         4 desc;

Costs

The cost of storing 104 GB of data on S3 is $3.12 if I were to leave the CSV data there for a month. In this case I had the data on there for a few hours so the costs were substantially less.

The 6-node ds2.8xlarge Redshift cluster in eu-west-1 runs at $45.60 / hour as of this writing. Interestingly I got this message when I launched the cluster:

Applicable charges: The on-demand hourly rate for this cluster will be $45.60, or $7.60 / node.

The Redshift console reports 1 leader node (complete with a public and private IP address) and 6 compute nodes. I guess the leader node was on the house.

The hourly rate is based on the hardware you're allocating so if you have a lot of analysts running ad-hoc queries or a large number of distinct queries in your reporting jobs your per-query costs should be pretty low.

The other thing to consider is that the cluster took 20 minutes to launch and the data took just over 11 minutes to load into Redshift. You could automate the setup, loading and tear down of the cluster with a tool like Airflow. In this sort of scenario a job could launch the cluster at 8:30am on Monday, run the reporting jobs and then leave the cluster running for 6 to 8 hours for any ad-hoc queries. If this was spread across the work week you'd be looking at 40 hours of usage instead of 168 shaving 76% off your Redshift bill. Redshift supports snapshots as well if you don't want to use S3 as your only warm data store.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.