Over the past few months I've been benchmarking a dataset of 1.1 billion taxi journeys made in New York City over a six year period on a number of data stores and cloud services. Among these has been AWS Redshift. Up until now I've been using their free-tier and single-compute node clusters. While these make Data Warehousing very inexpensive they aren't going to be representative of the incredible query speeds which can be achieved with Redshift.
In this post I'll be looking at how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records from the dataset I've put together.
Optimising Source Data
I'll be using the 56 GZIP'ed CSV files I originally put together in my Billion Taxi Rides on Redshift blog post a few months back. The cluster will have 6 compute nodes with 16 slices each so an equivalent number of files or multiple of this will ensure each slice in the cluster is utilised. So to do that I'll re-package my 56 GZIP files as 96 smaller GZIP files.
In the ~/taxi-trips folder on my system I have the original 56 files.
$ ls -lh ~/taxi-trips
-rw-rw-r-- 1 mark mark 1.9G Jun 1 10:23 trips_xaa.csv.gz
-rw-rw-r-- 1 mark mark 1.9G Jun 1 10:23 trips_xab.csv.gz
...
-rw-rw-r-- 1 mark mark 1.3G Jun 1 10:48 trips_xcd.csv.gz
The 56 gzip files have around 20M lines of data in each of them. I'll create 11.7M-line files so that the data fits across 96 files fairly evenly.
$ cd ~/taxi-trips
$ mkdir 96
$ gunzip -c trips_x*.csv.gz | \
split -l 11700000 \
--filter="gzip > 96/trips_96_\$FILE.csv.gz"
Bandwidth-Saturating S3 Uploads
Once those files were generated I uploaded them to an S3 bucket. Amazon were kind enough to point out a change to my AWS CLI configuration that would help saturate my upload bandwidth. The following will allow for 100 concurrent requests to run when interacting with S3.
$ aws configure set \
default.s3.max_concurrent_requests \
100
The following will sync the 96 folder with the csv folder on S3.
$ aws s3 sync \
~/taxi-trips/96/ \
s3://<s3_bucket>/csv/
I was monitoring the transmission speed on my main Ethernet interface during the transfer and it was sending at a solid 63 MBit/s with at most 2 MBit/s +/- of deviation for the entirety of the upload. I thought my Internet package entitled me to 50 Mbit/s up so this surprised me. But the upload duration was 39 seconds shy of five hours flat. Using wall clock time the effective upload rate was ~46 MBit/s. This is still a huge improvement over the 17.38 Mbit/s upload speed I saw a few weeks ago but it's a shame to see so much overhead.
For the record I was using version 1.10.34 of the AWS CLI tool which was released 5 days ago as of this writing.
$ virtualenv aws
$ source aws/bin/activate
$ pip install --upgrade awscli
$ aws --version
aws-cli/1.10.34 Python/2.7.6 Linux/3.19.0-25-generic botocore/1.4.24
I also made sure the SSD drive the data was stored on was performing well enough to not be a bottleneck by running a byte count on a 1.9GB file on the drive.
$ time (cat trips_xaa.csv.gz | wc -c)
2024092233
real 0m1.672s
user 0m0.000s
sys 0m1.668s
Building a Manifest File
Redshift will need a manifest file stating where all 96 GZIP files are being kept.
$ vi trips.manifest
{
"entries": [
{"url": "s3://<s3_bucket>/csv/trips_96_xaa.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xab.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xac.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xad.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xae.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xaf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xag.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xah.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xai.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xaj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xak.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xal.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xam.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xan.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xao.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xap.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xaq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xar.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xas.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xat.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xau.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xav.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xaw.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xax.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xay.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xaz.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xba.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbb.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbc.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbd.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbe.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbg.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbh.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbi.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbk.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbl.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbm.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbn.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbo.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbp.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbr.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbs.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbt.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbu.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbv.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbw.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbx.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xby.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xbz.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xca.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcb.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcc.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcd.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xce.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcg.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xch.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xci.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xck.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcl.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcm.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcn.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xco.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcp.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcr.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcs.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xct.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcu.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcv.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcw.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcx.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcy.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xcz.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xda.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdb.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdc.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdd.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xde.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdf.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdg.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdh.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdi.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdj.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdk.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdl.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdm.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdn.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdo.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdp.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdq.csv.gz", "mandatory": true},
{"url": "s3://<s3_bucket>/csv/trips_96_xdr.csv.gz", "mandatory": true}
]
}
The following will upload the manifest file to the S3 bucket.
$ aws s3 cp \
trips.manifest \
s3://<s3_bucket>/csv/
Redshift Up & Running
For this benchmark I will be using a 6-node ds2.8xlarge Redshift cluster. The 6 nodes refer to compute nodes and don't include the leader node in that count. This cluster's compute resources combined equate to 216 vCPUs, 1,464 GB RAM, 96 TB of mechanical storage and a 19.8 GB/s of I/O throughput.
It took 20 minutes for the cluster to launch in the eu-west-1b zone. This was conducted towards the end of the European work day.
It's important to note that when you launch that you do so within a VPC. VPC environments have serious networking improvements over the classic EC2 environments.
Loading Data into Redshift
I'll set environment variables containing the username and password for the cluster that I defined when I launched it.
$ read MASTER_USERNAME
$ read MASTER_PASSWORD
$ export MASTER_USERNAME
$ export MASTER_PASSWORD
I'll then use PostgreSQL's CLI tool to connect to the cluster.
$ PGPASSWORD=$MASTER_PASSWORD \
psql -h trips.c6rssls4aefs.eu-west-1.redshift.amazonaws.com \
-p 5439 \
-U $MASTER_USERNAME trips
The following is the schema for the trips table. The encodings used were suggested by Redshift's ANALYZE command. The data types used on each column were picked to be as granular as the underlying data would allow for. The original dataset does have some data quality issues so many of the DECIMAL fields allow for a much wider range of values than what would be expected on a well-groomed dataset. The sort key is not being encoded which is in-line with Amazon's guidance.
CREATE TABLE trips (
trip_id INTEGER NOT NULL DISTKEY ENCODE LZO,
vendor_id VARCHAR(3) ENCODE LZO,
-- Sort keys shouldn't be encoded (compressed)
pickup_datetime TIMESTAMP NOT NULL,
dropoff_datetime TIMESTAMP NOT NULL ENCODE LZO,
store_and_fwd_flag VARCHAR(1) ENCODE RUNLENGTH,
rate_code_id SMALLINT ENCODE LZO,
pickup_longitude DECIMAL(18,14) ENCODE MOSTLY8,
pickup_latitude DECIMAL(18,14) ENCODE MOSTLY8,
dropoff_longitude DECIMAL(18,14) ENCODE MOSTLY8,
dropoff_latitude DECIMAL(18,14) ENCODE MOSTLY8,
passenger_count SMALLINT NOT NULL DEFAULT '0' ENCODE LZO,
trip_distance DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
fare_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
extra DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
mta_tax DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
tip_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
tolls_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
ehail_fee DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
improvement_surcharge DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
total_amount DECIMAL(18,6) DEFAULT '0.0' ENCODE MOSTLY8,
payment_type VARCHAR(3) ENCODE RUNLENGTH,
trip_type SMALLINT ENCODE LZO,
pickup VARCHAR(50) ENCODE LZO,
dropoff VARCHAR(50) ENCODE LZO,
cab_type VARCHAR(6) NOT NULL ENCODE LZO,
precipitation SMALLINT DEFAULT '0' ENCODE LZO,
snow_depth SMALLINT DEFAULT '0' ENCODE LZO,
snowfall SMALLINT DEFAULT '0' ENCODE LZO,
max_temperature SMALLINT DEFAULT '0' ENCODE LZO,
min_temperature SMALLINT DEFAULT '0' ENCODE LZO,
average_wind_speed SMALLINT DEFAULT '0' ENCODE LZO,
pickup_nyct2010_gid SMALLINT ENCODE LZO,
pickup_ctlabel VARCHAR(10) ENCODE LZO,
pickup_borocode SMALLINT ENCODE LZO,
pickup_boroname VARCHAR(13) ENCODE LZO,
pickup_ct2010 VARCHAR(6) ENCODE LZO,
pickup_boroct2010 VARCHAR(7) ENCODE LZO,
pickup_cdeligibil VARCHAR(1) ENCODE RUNLENGTH,
pickup_ntacode VARCHAR(4) ENCODE LZO,
pickup_ntaname VARCHAR(56) ENCODE LZO,
pickup_puma VARCHAR(4) ENCODE LZO,
dropoff_nyct2010_gid SMALLINT ENCODE LZO,
dropoff_ctlabel VARCHAR(10) ENCODE LZO,
dropoff_borocode SMALLINT ENCODE LZO,
dropoff_boroname VARCHAR(13) ENCODE LZO,
dropoff_ct2010 VARCHAR(6) ENCODE LZO,
dropoff_boroct2010 VARCHAR(7) ENCODE LZO,
dropoff_cdeligibil VARCHAR(1) ENCODE RUNLENGTH,
dropoff_ntacode VARCHAR(4) ENCODE LZO,
dropoff_ntaname VARCHAR(56) ENCODE LZO,
dropoff_puma VARCHAR(4) ENCODE LZO,
primary key(trip_id)
) sortkey(pickup_datetime);
The following will allow my process to use 100% of the cluster's memory.
set wlm_query_slot_count to 5;
The following will tell Redshift to load all the data from the files on S3 into the trips table.
COPY trips
FROM 's3://<s3_bucket>/csv/trips.manifest'
CREDENTIALS
'aws_access_key_id=...;aws_secret_access_key=...'
DELIMITER ','
EMPTYASNULL
ESCAPE
GZIP
MANIFEST
MAXERROR 100000
REMOVEQUOTES
TRIMBLANKS
TRUNCATECOLUMNS;
The above took 11 minutes and 13 seconds to complete.
Benchmarking Redshift
I ran each of the queries below multiple times in a row. The times reported were the fastest achieved.
\timing on
set wlm_query_slot_count to 5;
The following completed in 1.56 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY 1;
The following completed in 1.25 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY 1;
The following completed in 2.25 seconds.
SELECT passenger_count,
extract(year from pickup_datetime),
count(*)
FROM trips
GROUP BY 1,
2;
The following completed in 2.97 seconds.
SELECT passenger_count,
extract(year from pickup_datetime),
round(trip_distance),
count(*)
FROM trips
GROUP BY 1,
2,
3
ORDER BY 2,
4 desc;
Costs
The cost of storing 104 GB of data on S3 is $3.12 if I were to leave the CSV data there for a month. In this case I had the data on there for a few hours so the costs were substantially less.
The 6-node ds2.8xlarge Redshift cluster in eu-west-1 runs at $45.60 / hour as of this writing. Interestingly I got this message when I launched the cluster:
Applicable charges: The on-demand hourly rate for this cluster will be $45.60, or $7.60 / node.
The Redshift console reports 1 leader node (complete with a public and private IP address) and 6 compute nodes. I guess the leader node was on the house.
The hourly rate is based on the hardware you're allocating so if you have a lot of analysts running ad-hoc queries or a large number of distinct queries in your reporting jobs your per-query costs should be pretty low.
The other thing to consider is that the cluster took 20 minutes to launch and the data took just over 11 minutes to load into Redshift. You could automate the setup, loading and tear down of the cluster with a tool like Airflow. In this sort of scenario a job could launch the cluster at 8:30am on Monday, run the reporting jobs and then leave the cluster running for 6 to 8 hours for any ad-hoc queries. If this was spread across the work week you'd be looking at 40 hours of usage instead of 168 shaving 76% off your Redshift bill. Redshift supports snapshots as well if you don't want to use S3 as your only warm data store.