1.2 Billion Taxi Rides on AWS RDS running PostgreSQL

On November 17th, 2015, Todd Schneider published a blog post titled Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance in which he analysed the metadata of 1.1 billion Taxi journeys made in New York City between 2009 and 2015. Included with this work was a link to a GitHub repository where he published the SQL, Shell and R files he used in his work and instructions on how to get everything up and running. There are a few additional charts created by the R files which were used in follow up posts as well.

In this blog post I'll launch 4 different types of AWS RDS instances running PostgreSQL 9.5.2 and benchmark creating the same graphs that Todd Schneider did in his analysis.

PostgreSQL on AWS RDS Up & Running

I'll launch 4 RDS instances. They will all be running PostgreSQL 9.5.2 in eu-west-1a. Each will have 400 GB of General Purpose SSD capacity and will not be replicated across additional availability zones. Their maintenance windows have been set so they will not run during this benchmark. They will each have their own VPC.

The instance types are as follows:

DB1 is a db.t2.large which costs $5.50 / day inc. storage costs for RDS alone.
DB2 is a db.r3.large which costs $8.52 / day inc. storage costs for RDS alone.
DB3 is a db.m4.large which costs $6.60 / day inc. storage costs for RDS alone.
DB4 is a db.m4.xlarge which costs $11.53 / day inc. storage costs for RDS alone.

The instances each took 5 minutes to launch.

Each RDS instance is paired with a t2.large EC2 instance running Ubuntu 14.04.3 LTS that also sit in eu-west-1a. These instances will run the data import and R-based analysis scripts. Each of these instances has 500 GB of General Purpose SSD (GP2) capacity and 1,500 IOPS each at their disposal. The instances are graded for low to moderate network performance. Each of these instances cost $4.57 / day including storage.

Downloading 1.2 Billion Taxi Trips

In this benchmark I'll be downloading all the data fresh and importing it into each RDS instance individually from each paired EC2 instance. The original dataset of 1.1 billion taxi journeys has grown with the release of the last half of 2015's data and is now has a little over 1.2 billion journeys.

The following are the commands I ran to bootstrap each of the EC2 instances and download the 300+ GB of data.

$ sudo apt update
$ sudo apt install \
      build-essential \
      git \
      postgis \
      postgresql-client \
      ruby \
      ruby-dev \
      unzip \
      zlib1g-dev

$ sudo gem install \
      activesupport \
      roo \
      rest-client

$ git clone https://github.com/toddwschneider/nyc-taxi-data.git
$ cd nyc-taxi-data
$ cat raw_uber_data_urls.txt \
      raw_data_urls.txt | \
      xargs -n 1 -P 6 \
          wget -P data/

Importing 1.2 Billion Taxi Trips

In PostgreSQL nyc-taxi-data is perfectly fine as a database name but on RDS the database name can only contain letters, numbers and underscores. For this reason I named the databases trips. In order for the shell and ruby scripts to be able to work with this new database name I patched them with the following:

$ sed -i 's/psql nyc-taxi-data/psql trips/' \
      {*.sh,tlc_statistics/import_statistics_data.rb}
$ sed -i 's/psql -d nyc-taxi-data/psql -d trips/' *.sh
$ sed -i 's/createdb/#createdb/' *.sh

I then created environment variables on each EC2 instance for their respective RDS pairings.

$ read PGHOST
$ read PGUSER
$ read PGPASSWORD

$ export PGHOST
$ export PGUSER
$ export PGPASSWORD

I then ran the database initialisation, population and statistics generation scripts.

$ time (
      ./initialize_database.sh;
      ./import_trip_data.sh;
      ./import_uber_trip_data.sh;
      cat analysis/prepare_analysis.sql \
            tlc_statistics/create_statistics_tables.sql | \
            psql trips;
      cd tlc_statistics;
      ruby import_statistics_data.rb
  )

The following were the durations I observed:

The db.t2.large instance took 60 hours, 38 minutes and 42 seconds costing $13.98 excluding EC2 costs.
The db.r3.large instance took 71 hours, 29 minutes and 27 seconds costing $25.56 excluding EC2 costs.
The db.m4.large instance took 62 hours, 13 minutes and 47 seconds costing $17.05 excluding EC2 costs.
The db.m4.xlarge instance took 51 hours, 43 minutes and 56 seconds costing $24.98 excluding EC2 costs.

R Up & Running

The following was run to install R and various other dependencies for the reports Todd Schneider wrote.

$ echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | \
    sudo tee -a /etc/apt/sources.list
$ gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
$ gpg -a --export E084DAB9 | sudo apt-key add -
$ sudo apt update

$ sudo apt install \
    git \
    libgdal-dev \
    libpq-dev \
    libproj-dev \
    r-base \
    r-base-dev

$ mkdir -p $HOME/.R_libs
$ export R_LIBS="$HOME/.R_libs"

$ echo 'requirements = c("ggplot2",
                         "ggmap",
                         "dplyr",
                         "reshape2",
                         "zoo",
                         "scales",
                         "extrafont",
                         "grid",
                         "RPostgreSQL",
                         "rgdal",
                         "maptools",
                         "gpclib")

        sapply(requirements,
               function(x) {
                    if (!x %in% installed.packages()[,"Package"])
                        install.packages(x, repos="http://cran.r-project.org")})' | \
  R --no-save

Before running the analysis.R file I needed to patch the database connector to use the environment variables I set earlier rather than the hard-coded localhost setup.

$ cd ~/nyc-taxi-data/analysis/
$ vi helpers.R

The following line in helpers.R:

con = dbConnect(dbDriver("PostgreSQL"), dbname = "nyc-taxi-data", host = "localhost")

Was replaced with the following:

con = dbConnect(dbDriver("PostgreSQL"),
                dbname = "trips",
                host = Sys.getenv('PGHOST'),
                user = Sys.getenv('PGUSER'),
                password = Sys.getenv('PGPASSWORD'))

Benchmarking RDS

I ran the analysis.R script three times from the respective EC2 pairs. Each of the results has been rounded to the nearest second.

$ time (cat analysis.R | R --no-save)

The db.t2.large instance reported the following times:

Run 1: 5 minutes 21 seconds
Run 2: 5 minutes 25 seconds
Run 3: 5 minutes 39 seconds

The db.r3.large instance reported the following times:

Run 1: 6 minutes
Run 2: 5 minutes 30 seconds
Run 3: 5 minutes 34 seconds

The db.m4.large instance reported the following times:

Run 1: 5 minutes 43 seconds
Run 2: 5 minutes 34 seconds
Run 3: 5 minutes 33 seconds

The db.m4.xlarge instance reported the following times:

Run 1: 5 minutes 34 seconds
Run 2: 5 minutes 26 seconds
Run 3: 5 minutes 41 seconds

For this workload the reporting speeds don't line up well with the price differences between the RDS instances. I suspect this workload is biased towards R's CPU consumption when generating PNGs rather than RDS' performance when returning aggregate results. The RDS instances share the same number of IOPS each which might erase any other performance advantage they could have over one another.

As for the money spent importing the data into RDS I suspect scaling up is more helpful when you have a number of concurrent users rather than a single, large job to execute.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.