Vertica is a distributed, column-oriented analytics database. It started life out in 2005 being developed by, among others, Michael Stonebraker. Stonebraker has a long history of developing database software including Ingres and PostgreSQL. In 2011, Vertica Systems was acquired by HP. Not long after a free community version of Vertica was released.
Vertica has had some famous users including Facebook, Groupon and Zynga. Maxime Beauchemin's LinkedIn profile describes part of his time at Facebook was spent on pushing the limits of Vertica to satisfy the firm's needs.
Vertica will be best suited on a cluster of machines but for this benchmark I'm interested in how fast it can work off of a single machine.
The dataset being used is the same one I've used to benchmark Amazon Athena, BigQuery, Elasticsearch, kdb+/q, MapD, PostgreSQL, Presto, Redshift and Spark. I've compiled a single-page summary of these benchmarks.
The Hardware
For this benchmark I'll be running a fresh installation of Ubuntu 14.04.3 LTS on an Intel Core i5 4670K clocked at 3.4 GHz, 16 GB of DDR3 RAM and a SanDisk SDSSDHII960G 960 GB SSD drive.
Vertica Up & Running
I'll first install a number of prerequisites.
$ sudo apt update
$ sudo apt install \
dialog \
libsensors4 \
mcelog \
ntp \
pstack \
sysstat
I'll then set the time zone on my system and make sure the system clock is synchronised.
$ sudo dpkg-reconfigure tzdata
$ sudo service ntp stop
$ sudo ntpd -gq
$ sudo service ntp start
For this benchmark I've downloaded version 8.0.1 of the Community Edition of Vertica to my home folder. The following will deploy Vertica's installation files.
$ sudo dpkg -i vertica_8.0.1-0_amd64.deb
Selecting previously unselected package vertica.
(Reading database ... 56852 files and directories currently installed.)
Preparing to unpack vertica_8.0.1-0_amd64.deb ...
Unpacking vertica (8.0.1-0) ...
Setting up vertica (8.0.1-0) ...
Vertica Analytic Database V8.0.1-0 successfully installed on host ubuntu
To complete your NEW installation and configure the cluster, run:
/opt/vertica/sbin/install_vertica
To complete your Vertica UPGRADE, run:
/opt/vertica/sbin/update_vertica
----------------------------------------------------------------------------------
Important
----------------------------------------------------------------------------------
Before upgrading Vertica, you must backup your database. After you restart your
database after upgrading, you cannot revert to a previous Vertica software version.
----------------------------------------------------------------------------------
View the latest Vertica documentation at http://my.vertica.com/docs/
Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
Processing triggers for libc-bin (2.19-0ubuntu6.6) ...
With those files in place I'll launch Vertica's installer.
$ sudo /opt/vertica/sbin/install_vertica \
--hosts=127.0.0.1 \
--failure-threshold=NONE
The installer will create a dbadmin user that I'll used to operate the database for this exercise.
Vertica Analytic Database 8.0.1-0 Installation Tool
>> Validating options...
Mapping hostnames in --hosts (-s) to addresses...
>> Starting installation tasks.
>> Getting system information for cluster (this may take a while)...
Default shell on nodes:
127.0.0.1 /bin/bash
>> Validating software versions (rpm or deb)...
>> Beginning new cluster creation...
successfully backed up admintools.conf on 127.0.0.1
>> Creating or validating DB Admin user/group...
Password for new dbadmin user (empty = disabled)
Successful on hosts (1): 127.0.0.1
Provided DB Admin account details: user = dbadmin, group = verticadba, home = /home/dbadmin
Creating group... Adding group
Validating group... Okay
Creating user... Adding user, Setting credentials
Validating user... Okay
>> Validating node and cluster prerequisites...
Prerequisites not fully met during local (OS) configuration for
verify-127.0.0.1.xml:
HINT (S0151): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0151
These disks do not have known IO schedulers: '/dev/mapper/ubuntu--vg-
root' ('') = ''
HINT (S0305): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0305
TZ is unset for dbadmin. Consider updating .profile or .bashrc
WARN (S0170): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0170
lsblk (LVM utility) indicates LVM on the data directory.
FAIL (S0020): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0020
Readahead size of (/dev/mapper/ubuntu--vg-root) is too low for typical
systems: 256 < 2048
FAIL (S0310): https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=S0310
Transparent hugepages is set to 'always'. Must be 'never' or 'madvise'.
System prerequisites passed. Threshold = NONE
>> Establishing DB Admin SSH connectivity...
Installing/Repairing SSH keys for dbadmin
>> Setting up each node and modifying cluster...
Creating Vertica Data Directory...
Updating agent...
Creating node node0001 definition for host 127.0.0.1
... Done
>> Sending new cluster configuration to all nodes...
Starting agent...
>> Completing installation...
Running upgrade logic
No spread upgrade required: /opt/vertica/config/vspread.conf not found on any node
Installation complete.
Please evaluate your hardware using Vertica's validation tools:
https://my.vertica.com/docs/8.0.x/HTML/index.htm#cshid=VALSCRIPT
To create a database:
1. Logout and login as dbadmin. (see note below)
2. Run /opt/vertica/bin/adminTools as dbadmin
3. Select Create Database from the Configuration Menu
Note: Installation may have made configuration changes to dbadmin
that do not take effect until the next session (logout and login).
To add or remove hosts, select Cluster Management from the Advanced Menu.
The installer complained the dbadmin user doesn't have a time zone set for its newly created account so I'll set it manually before switching into the user.
$ sudo sh -c "echo 'export TZ=Europe/Tallinn' >> \
/home/dbadmin/.bash_profile"
$ sudo su - dbadmin
I'll launch Vertica's Administration Tools to create a database.
$ /opt/vertica/bin/adminTools
This process is conducted through a series of text-based menus. I took the following steps through them.
- Skip past the licence file.
- Accept the EULA.
- Select option #6 "Configuration Menu".
- Select option #1 "Create Database".
- Create a database named "trips".
- Create a password for the new database.
- Select the "127.0.0.1" host for the database.
- Set the catalogue and data pathnames to /home/dbadmin.
- After the database is created, go to the main menu and select "exit".
Loading 1.1 Billion Trips into Vertica
I'll create an environment variable to store the password for the database.
$ read VERTICA_PASS
$ export VERTICA_PASS
I'll then launch the command line client for Vertica.
$ /opt/vertica/bin/vsql \
-U dbadmin \
-w $VERTICA_PASS
The interface feels a lot like PostgreSQL. Running the \? command will output a list of available commands just like psql does.
dbadmin=> \?
See the Vertica Programmer's Guide for information on available commands.
General
\c[onnect] [DBNAME|- [USER]]
connect to new database (currently "dbadmin")
\cd [DIR] change the current working directory
\q quit vsql
\set [NAME [VALUE]]
set internal variable, or list all if no parameters
\timing toggle timing of commands (currently off)
\unset NAME unset (delete) internal variable
\! [COMMAND] execute command in shell or start interactive shell
\password [USER]
change user's password
Query Buffer
\e [FILE] edit the query buffer (or file) with external editor
\g send query buffer to server
\g FILE send query buffer to server and results to file
\g | COMMAND send query buffer to server and pipe results to command
\p show the contents of the query buffer
\r reset (clear) the query buffer
\s [FILE] display history or save it to file
\w FILE write query buffer to file
Input/Output
\echo [STRING] write string to standard output
\i FILE execute commands from file
\o FILE send all query results to file
\o | COMMAND pipe all query results to command
\o close query-results file or pipe
\qecho [STRING]
write string to query output stream (see \o)
Informational
\d [PATTERN] describe tables (list tables if no argument is supplied)
PATTERN may include system schema name, e.g. v_catalog.*
\df [PATTERN] list functions
\dj [PATTERN] list projections
\dn [PATTERN] list schemas
\dp [PATTERN] list table access privileges
\ds [PATTERN] list sequences
\dS [PATTERN] list system tables. PATTERN may include system schema name
such as v_catalog, v_monitor, or v_internal.
Example: v_catalog.a*
\dt [PATTERN] list tables
\dtv [PATTERN] list tables and views
\dT [PATTERN] list data types
\du [PATTERN] list users
\dv [PATTERN] list views
\l list all databases
\z [PATTERN] list table access privileges (same as \dp)
Formatting
\a toggle between unaligned and aligned output mode
\b toggle beep on command completion
\C [STRING] set table title, or unset if none
\f [STRING] show or set field separator for unaligned query output
\H toggle HTML output mode (currently off)
\pset NAME [VALUE]
set table output option
(NAME := {format|border|expanded|fieldsep|footer|null|
recordsep|trailingrecordsep|tuples_only|title|tableattr|pager})
\t show only rows (currently off)
\T [STRING] set HTML <table> tag attributes, or unset if none
\x toggle expanded output (currently off)
I'll create a table that will store the 1.1 billion taxi trips dataset.
CREATE TABLE trips (
trip_id INTEGER,
vendor_id VARCHAR(3),
pickup_datetime DATETIME,
dropoff_datetime DATETIME,
store_and_fwd_flag VARCHAR(1),
rate_code_id SMALLINT,
pickup_longitude DECIMAL(18,14),
pickup_latitude DECIMAL(18,14),
dropoff_longitude DECIMAL(18,14),
dropoff_latitude DECIMAL(18,14),
passenger_count SMALLINT,
trip_distance DECIMAL(6,3),
fare_amount DECIMAL(6,2),
extra DECIMAL(6,2),
mta_tax DECIMAL(6,2),
tip_amount DECIMAL(6,2),
tolls_amount DECIMAL(6,2),
ehail_fee DECIMAL(6,2),
improvement_surcharge DECIMAL(6,2),
total_amount DECIMAL(6,2),
payment_type VARCHAR(3),
trip_type SMALLINT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6),
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10),
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4)
) ORDER BY pickup_datetime, dropoff_datetime;
I'll then exit to the command line and execute the following to load the dataset in.
The /home/mark/trips/ folder on my system has had it and its contents set to be owned by dbadmin. There are 56 gzip-compressed CSV files that make up the 1.1-billion-record dataset.
$ time (echo "COPY trips FROM '/home/mark/trips/trips_x*.csv.gz'
GZIP DELIMITER ',' DIRECT;" | \
/opt/vertica/bin/vsql \
-U dbadmin \
-w $VERTICA_PASS)
The above took 3 hours 56 minutes and 43 seconds to complete.
The dataset uses 153 GB of disk capacity when stored using Vertica's internal storage format.
$ du -hs /home/dbadmin/trips/v_trips_node0001_data/
153G /home/dbadmin/trips/v_trips_node0001_data/
Benchmarking Vertica
I'll execute each query using the vsql command line tool.
$ /opt/vertica/bin/vsql \
-U dbadmin \
-w $VERTICA_PASS
To time the queries I'll switch on Vertica's timing mechanism using the \timing command.
\timing
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use lowest query time as a way of indicating "top speed".
The following completed in 14.389 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 32.148 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 33.448 seconds.
SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM trips
GROUP BY passenger_count,
year(pickup_datetime);
The following completed in 67.312 seconds.
SELECT passenger_count,
year(pickup_datetime) trip_year,
round(trip_distance),
count(*) trips
FROM trips
GROUP BY passenger_count,
year(pickup_datetime),
round(trip_distance)
ORDER BY trip_year,
trips desc;
While I was running these queries I could see all 4 cores of my CPU being utilised but none were spiking much beyond 60% usage. The disk was reading at 34 MB/s even though it is capable of reaching over 500 MB/s.
I think these times are good considering the software is free of charge, behaves a lot like PostgreSQL and requires very little work to get up and running. That being said it would be interesting to see what sort of performance could be found with a cluster of machines, each with a lot more cores and perhaps some more tuning of Linux's environment.