1.1 Billion Taxi Rides on Vertica & an Intel Core i5

Vertica is a distributed, column-oriented analytics database. It started life out in 2005 being developed by, among others, Michael Stonebraker. Stonebraker has a long history of developing database software including Ingres and PostgreSQL. In 2011, Vertica Systems was acquired by HP. Not long after a free community version of Vertica was released.

Vertica has had some famous users including Facebook, Groupon and Zynga. Maxime Beauchemin's LinkedIn profile describes part of his time at Facebook was spent on pushing the limits of Vertica to satisfy the firm's needs.

Vertica will be best suited on a cluster of machines but for this benchmark I'm interested in how fast it can work off of a single machine.

The dataset being used is the same one I've used to benchmark Amazon Athena, BigQuery, Elasticsearch, kdb+/q, MapD, PostgreSQL, Presto, Redshift and Spark. I've compiled a single-page summary of these benchmarks.

The Hardware

For this benchmark I'll be running a fresh installation of Ubuntu 14.04.3 LTS on an Intel Core i5 4670K clocked at 3.4 GHz, 16 GB of DDR3 RAM and a SanDisk SDSSDHII960G 960 GB SSD drive.

Vertica Up & Running

I'll first install a number of prerequisites.

$ sudo apt update
$ sudo apt install \
    dialog \
    libsensors4 \
    mcelog \
    ntp \
    pstack \
    sysstat

I'll then set the time zone on my system and make sure the system clock is synchronised.

$ sudo dpkg-reconfigure tzdata
$ sudo service ntp stop
$ sudo ntpd -gq
$ sudo service ntp start

For this benchmark I've downloaded version 8.0.1 of the Community Edition of Vertica to my home folder. The following will deploy Vertica's installation files.

$ sudo dpkg -i vertica_8.0.1-0_amd64.deb

Selecting previously unselected package vertica.
(Reading database ... 56852 files and directories currently installed.)
Preparing to unpack vertica_8.0.1-0_amd64.deb ...
Unpacking vertica (8.0.1-0) ...
Setting up vertica (8.0.1-0) ...

Vertica Analytic Database V8.0.1-0 successfully installed on host ubuntu

To complete your NEW installation and configure the cluster, run:
 /opt/vertica/sbin/install_vertica

To complete your Vertica UPGRADE, run:
 /opt/vertica/sbin/update_vertica

----------------------------------------------------------------------------------
Important
----------------------------------------------------------------------------------
Before upgrading Vertica, you must backup your database.  After you restart your
database after upgrading, you cannot revert to a previous Vertica software version.
----------------------------------------------------------------------------------

View the latest Vertica documentation at http://my.vertica.com/docs/

Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
Processing triggers for libc-bin (2.19-0ubuntu6.6) ...

With those files in place I'll launch Vertica's installer.

$ sudo /opt/vertica/sbin/install_vertica \
    --hosts=127.0.0.1 \
    --failure-threshold=NONE

The installer will create a dbadmin user that I'll used to operate the database for this exercise.

Vertica Analytic Database 8.0.1-0 Installation Tool


>> Validating options...


Mapping hostnames in --hosts (-s) to addresses...

>> Starting installation tasks.
>> Getting system information for cluster (this may take a while)...

Default shell on nodes:
127.0.0.1 /bin/bash

>> Validating software versions (rpm or deb)...


>> Beginning new cluster creation...

successfully backed up admintools.conf on 127.0.0.1

>> Creating or validating DB Admin user/group...

Password for new dbadmin user (empty = disabled)
Successful on hosts (1): 127.0.0.1
    Provided DB Admin account details: user = dbadmin, group = verticadba, home = /home/dbadmin
    Creating group... Adding group
    Validating group... Okay
    Creating user... Adding user, Setting credentials
    Validating user... Okay


>> Validating node and cluster prerequisites...

Prerequisites not fully met during local (OS) configuration for
verify-127.0.0.1.xml:
    HINT (S0151): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0151
        These disks do not have known IO schedulers: '/dev/mapper/ubuntu--vg-
        root' ('') = ''
    HINT (S0305): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0305
        TZ is unset for dbadmin. Consider updating .profile or .bashrc
    WARN (S0170): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0170
        lsblk (LVM utility) indicates LVM on the data directory.
    FAIL (S0020): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0020
        Readahead size of  (/dev/mapper/ubuntu--vg-root) is too low for typical
        systems: 256 < 2048
    FAIL (S0310): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0310
        Transparent hugepages is set to 'always'. Must be 'never' or 'madvise'.

System prerequisites passed.  Threshold = NONE


>> Establishing DB Admin SSH connectivity...

Installing/Repairing SSH keys for dbadmin


>> Setting up each node and modifying cluster...

Creating Vertica Data Directory...

Updating agent...
Creating node node0001 definition for host 127.0.0.1
... Done

>> Sending new cluster configuration to all nodes...

Starting agent...

>> Completing installation...

Running upgrade logic
No spread upgrade required: /opt/vertica/config/vspread.conf not found on any node
Installation complete.

Please evaluate your hardware using Vertica's validation tools:
    https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=VALSCRIPT

To create a database:
  1. Logout and login as dbadmin. (see note below)
  2. Run /opt/vertica/bin/adminTools as dbadmin
  3. Select Create Database from the Configuration Menu

  Note: Installation may have made configuration changes to dbadmin
  that do not take effect until the next session (logout and login).

To add or remove hosts, select Cluster Management from the Advanced Menu.

The installer complained the dbadmin user doesn't have a time zone set for its newly created account so I'll set it manually before switching into the user.

$ sudo sh -c "echo 'export TZ=Europe/Tallinn' >> \
    /home/dbadmin/.bash_profile"
$ sudo su - dbadmin

I'll launch Vertica's Administration Tools to create a database.

$ /opt/vertica/bin/adminTools

This process is conducted through a series of text-based menus. I took the following steps through them.

Skip past the licence file.
Accept the EULA.
Select option #6 "Configuration Menu".
Select option #1 "Create Database".
Create a database named "trips".
Create a password for the new database.
Select the "127.0.0.1" host for the database.
Set the catalogue and data pathnames to /home/dbadmin.
After the database is created, go to the main menu and select "exit".

Loading 1.1 Billion Trips into Vertica

I'll create an environment variable to store the password for the database.

$ read VERTICA_PASS
$ export VERTICA_PASS

I'll then launch the command line client for Vertica.

$ /opt/vertica/bin/vsql \
    -U dbadmin \
    -w $VERTICA_PASS

The interface feels a lot like PostgreSQL. Running the \? command will output a list of available commands just like psql does.

dbadmin=> \?

See the Vertica Programmer's Guide for information on available commands.

General
  \c[onnect] [DBNAME|- [USER]]
                 connect to new database (currently "dbadmin")
  \cd [DIR]      change the current working directory
  \q             quit vsql
  \set [NAME [VALUE]]
                 set internal variable, or list all if no parameters
  \timing        toggle timing of commands (currently off)
  \unset NAME    unset (delete) internal variable
  \! [COMMAND]   execute command in shell or start interactive shell
  \password [USER]
                 change user's password

Query Buffer
  \e [FILE]      edit the query buffer (or file) with external editor
  \g             send query buffer to server
  \g FILE        send query buffer to server and results to file
  \g | COMMAND   send query buffer to server and pipe results to command
  \p             show the contents of the query buffer
  \r             reset (clear) the query buffer
  \s [FILE]      display history or save it to file
  \w FILE        write query buffer to file

Input/Output
  \echo [STRING] write string to standard output
  \i FILE        execute commands from file
  \o FILE        send all query results to file
  \o | COMMAND   pipe all query results to command
  \o             close query-results file or pipe
  \qecho [STRING]
                 write string to query output stream (see \o)

Informational
  \d [PATTERN]   describe tables (list tables if no argument is supplied)
                 PATTERN may include system schema name, e.g. v_catalog.*
  \df [PATTERN]  list functions
  \dj [PATTERN]  list projections
  \dn [PATTERN]  list schemas
  \dp [PATTERN]  list table access privileges
  \ds [PATTERN]  list sequences
  \dS [PATTERN]  list system tables. PATTERN may include system schema name
                 such as v_catalog, v_monitor, or v_internal.
                 Example: v_catalog.a*
  \dt [PATTERN]  list tables
  \dtv [PATTERN] list tables and views
  \dT [PATTERN]  list data types
  \du [PATTERN]  list users
  \dv [PATTERN]  list views
  \l             list all databases
  \z [PATTERN]   list table access privileges (same as \dp)

Formatting
  \a             toggle between unaligned and aligned output mode
  \b             toggle beep on command completion
  \C [STRING]    set table title, or unset if none
  \f [STRING]    show or set field separator for unaligned query output
  \H             toggle HTML output mode (currently off)
  \pset NAME [VALUE]
                 set table output option
                 (NAME := {format|border|expanded|fieldsep|footer|null|
                 recordsep|trailingrecordsep|tuples_only|title|tableattr|pager})
  \t             show only rows (currently off)
  \T [STRING]    set HTML <table> tag attributes, or unset if none
  \x             toggle expanded output (currently off)

I'll create a table that will store the 1.1 billion taxi trips dataset.

CREATE TABLE trips (
    trip_id                 INTEGER,
    vendor_id               VARCHAR(3),
    pickup_datetime         DATETIME,
    dropoff_datetime        DATETIME,
    store_and_fwd_flag      VARCHAR(1),
    rate_code_id            SMALLINT,
    pickup_longitude        DECIMAL(18,14),
    pickup_latitude         DECIMAL(18,14),
    dropoff_longitude       DECIMAL(18,14),
    dropoff_latitude        DECIMAL(18,14),
    passenger_count         SMALLINT,
    trip_distance           DECIMAL(6,3),
    fare_amount             DECIMAL(6,2),
    extra                   DECIMAL(6,2),
    mta_tax                 DECIMAL(6,2),
    tip_amount              DECIMAL(6,2),
    tolls_amount            DECIMAL(6,2),
    ehail_fee               DECIMAL(6,2),
    improvement_surcharge   DECIMAL(6,2),
    total_amount            DECIMAL(6,2),
    payment_type            VARCHAR(3),
    trip_type               SMALLINT,
    pickup                  VARCHAR(50),
    dropoff                 VARCHAR(50),

    cab_type                VARCHAR(6),

    precipitation           SMALLINT,
    snow_depth              SMALLINT,
    snowfall                SMALLINT,
    max_temperature         SMALLINT,
    min_temperature         SMALLINT,
    average_wind_speed      SMALLINT,

    pickup_nyct2010_gid     SMALLINT,
    pickup_ctlabel          VARCHAR(10),
    pickup_borocode         SMALLINT,
    pickup_boroname         VARCHAR(13),
    pickup_ct2010           VARCHAR(6),
    pickup_boroct2010       VARCHAR(7),
    pickup_cdeligibil       VARCHAR(1),
    pickup_ntacode          VARCHAR(4),
    pickup_ntaname          VARCHAR(56),
    pickup_puma             VARCHAR(4),

    dropoff_nyct2010_gid    SMALLINT,
    dropoff_ctlabel         VARCHAR(10),
    dropoff_borocode        SMALLINT,
    dropoff_boroname        VARCHAR(13),
    dropoff_ct2010          VARCHAR(6),
    dropoff_boroct2010      VARCHAR(7),
    dropoff_cdeligibil      VARCHAR(1),
    dropoff_ntacode         VARCHAR(4),
    dropoff_ntaname         VARCHAR(56),
    dropoff_puma            VARCHAR(4)
) ORDER BY pickup_datetime, dropoff_datetime;

I'll then exit to the command line and execute the following to load the dataset in.

The /home/mark/trips/ folder on my system has had it and its contents set to be owned by dbadmin. There are 56 gzip-compressed CSV files that make up the 1.1-billion-record dataset.

$ time (echo "COPY trips FROM '/home/mark/trips/trips_x*.csv.gz'
              GZIP DELIMITER ',' DIRECT;" | \
      /opt/vertica/bin/vsql \
          -U dbadmin \
          -w $VERTICA_PASS)

The above took 3 hours 56 minutes and 43 seconds to complete.

The dataset uses 153 GB of disk capacity when stored using Vertica's internal storage format.

$ du -hs /home/dbadmin/trips/v_trips_node0001_data/

153G    /home/dbadmin/trips/v_trips_node0001_data/

Benchmarking Vertica

I'll execute each query using the vsql command line tool.

$ /opt/vertica/bin/vsql \
    -U dbadmin \
    -w $VERTICA_PASS

To time the queries I'll switch on Vertica's timing mechanism using the \timing command.

\timing

The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use lowest query time as a way of indicating "top speed".

The following completed in 14.389 seconds.

SELECT cab_type,
       count(*)
FROM trips
GROUP BY cab_type;

The following completed in 32.148 seconds.

SELECT passenger_count,
       avg(total_amount)
FROM trips
GROUP BY passenger_count;

The following completed in 33.448 seconds.

SELECT passenger_count,
       year(pickup_datetime),
       count(*)
FROM trips
GROUP BY passenger_count,
         year(pickup_datetime);

The following completed in 67.312 seconds.

SELECT passenger_count,
       year(pickup_datetime) trip_year,
       round(trip_distance),
       count(*) trips
FROM trips
GROUP BY passenger_count,
         year(pickup_datetime),
         round(trip_distance)
ORDER BY trip_year,
         trips desc;

While I was running these queries I could see all 4 cores of my CPU being utilised but none were spiking much beyond 60% usage. The disk was reading at 34 MB/s even though it is capable of reaching over 500 MB/s.

I think these times are good considering the software is free of charge, behaves a lot like PostgreSQL and requires very little work to get up and running. That being said it would be interesting to see what sort of performance could be found with a cluster of machines, each with a lot more cores and perhaps some more tuning of Linux's environment.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.