Many believe that for near-instant analytics on billions of records you'd need dedicated Linux clusters, several GPUs or proprietary Cloud offerings. Some of my fastest benchmarks were run on such environments. But in 2020, an off-the-shelf MacBook Pro using OmniSciDB (formerly MapD) can happily do the job.
In large enterprises, getting sign off for new clusters and bringing in expertise to keep them operational is rarely a quick process. If the value of a new dataset hasn't yet been proven then the bureaucratic hurdles can end up putting businesses off from using some of the fastest analytical offerings on the market.
To add to this, many industries need to either keep their data isolated from the Cloud, have policies to not spend budget with certain Cloud vendors, see the transferring of data too lengthy for their time-sensitive needs or impractical to shift their datasets reliably with their existing infrastructure and expertise. Only software that can run on mainstream hardware on-premise or within an enterprise's existing data centre is likely to see adoption.
It's rare to find anyone in an office environment that doesn't at least have a laptop but most don't come with top-notch Nvidia GPUs, let alone run CUDA-friendly Linux environments. This is why I was excited to hear that OmniSciDB would not only target Intel CPUs as a first-class platform but their offering would install and run in a user-friendly manner on macOS, just as any other mac software would.
OmniSciDB's GPU offering still outperforms their CPU offering by some 3.2x according to benchmarks published on their homepage but good should never be the enemy of perfect. I've always aimed to offer my clients the most practical solutions and sometimes that is a laptop proving there is value in their data.
In this post, I'm going to see how fast OmniSciDB 5.3.1 can run the 1.1 billion taxi rides benchmark using a 16" MacBook Pro running macOS. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.
The 16" MacBook Pro
The laptop I'll be using is a 2019, Space Grey, 16" MacBook Pro running macOS 10.15.5. The CPU is an 8-core Intel Core i9 running at 2.4 GHz. There is 64 GB of 2666 MHz DDR4 RAM and an Apple AP2048N SSD with 2 TB of capacity connected via PCIe. AmorphousDiskMark 2.5.4 was able to read from this disk at 3,454.22 MB/s and write at 3,283.94 MB/s when working with 128 KB blocks sequentially with a queue depth of 32.
The GPU won't be used by OmniSciDB in this benchmark but for the record it's an AMD Radeon Pro 5500M with 8GB of GDDR6 GPU RAM. This discrete GPU was a $200 upgrade over the stock GPU Apple ships with this notebook. Nonetheless, it won't have a material impact on this benchmark. While OmniSci can leverage Nvidia GPUs to accelerate queries, it is running entirely on the CPU in this benchmark.
This machine currently retails for about $4,000 in the US before any sales taxes.
Importing 1.1 Billion Trips Into OmniSciDB
The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over six years. The raw dataset lives as 56 GZIP-compressed CSV files that are 104 GB when compressed and need 500 GB of space when decompressed. The Billion Taxi Rides in Redshift blog post goes into detail regarding how I put this dataset together and describes the columns it contains in further detail.
OmniSciDB has a Java dependency so I've opted to install OpenJDK 8 LTS using the HotSpot JVM from AdoptOpenJDK.
The default number of file descriptors a process can have open on macOS is 256. OmniSciDB will likely use more depending on the size of data, number of tables and number of concurrent connections. This limit can be increased on a per-session basis but instead I'll make the upgrade permanent with the following commands.
$ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxproc.plist
$ curl -O https://raw.githubusercontent.com/wilsonmar/mac-setup/master/configs/limit.maxfiles.plist
$ sudo cp limit.max{proc,files}.plist /Library/LaunchDaemons/
$ sudo chmod 644 /Library/LaunchDaemons/limit.max{proc,files}.plist
$ sudo launchctl load -w /Library/LaunchDaemons/limit.maxproc.plist
$ sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
The following proves that the file descriptors limit was increased.
$ ulimit -n
524288
I was provided with a build of OmniSciDB 5.3.1 for macOS that I'll install via the Terminal in my home directory. I've decompressed the distributable and created a symlink so that I can address the parent folder as ~/omnisci.
$ cd ~
$ tar xvf omnisci-ee-master-Darwin-x86_64-cpu.tar.gz
$ ln -sfn $(ls -dt omnisci-ee* | head -n1) ~/omnisci
I'll then create a data folder for OmniSciDB.
$ mkdir -p ~/omnisci-data
I'll then initialise the data folder and launch OmniSciDB's Server.
$ cd ~/omnisci
$ bin/initdb ~/omnisci-data
$ bin/omnisci_server --data ~/omnisci-data
The above server process was kept running in a Terminal. You could also run it as a daemon, in a screen session or via tmux if you wish.
I'll create an alias to the OmniSciDB client and include the default credentials. This will let me launch it by typing omnisql in the Terminal.
$ alias omnisql="~/omnisci/bin/omnisql -p HyperInteractive"
The following will create the table for the taxi trips dataset.
$ omnisql
DROP TABLE IF EXISTS trips;
CREATE TABLE trips (
trip_id INTEGER,
vendor_id TEXT ENCODING DICT(8),
pickup_datetime TIMESTAMP ENCODING FIXED(32),
dropoff_datetime TIMESTAMP ENCODING FIXED(32),
store_and_fwd_flag TEXT ENCODING DICT(8),
rate_code_id SMALLINT,
pickup_longitude DECIMAL(14,2),
pickup_latitude DECIMAL(14,2),
dropoff_longitude DECIMAL(14,2),
dropoff_latitude DECIMAL(14,2),
passenger_count SMALLINT,
trip_distance DECIMAL(14,2),
fare_amount DECIMAL(14,2),
extra DECIMAL(14,2),
mta_tax DECIMAL(14,2),
tip_amount DECIMAL(14,2),
tolls_amount DECIMAL(14,2),
ehail_fee DECIMAL(14,2),
improvement_surcharge DECIMAL(14,2),
total_amount DECIMAL(14,2),
payment_type TEXT ENCODING DICT(8),
trip_type SMALLINT,
pickup TEXT ENCODING DICT(16),
dropoff TEXT ENCODING DICT(16),
cab_type TEXT ENCODING DICT(8),
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel TEXT ENCODING DICT(16),
pickup_borocode SMALLINT,
pickup_boroname TEXT ENCODING DICT(8),
pickup_ct2010 TEXT ENCODING DICT(16),
pickup_boroct2010 TEXT ENCODING DICT(16),
pickup_cdeligibil TEXT ENCODING DICT(8),
pickup_ntacode TEXT ENCODING DICT(8),
pickup_ntaname TEXT ENCODING DICT(8),
pickup_puma TEXT ENCODING DICT(8),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel TEXT ENCODING DICT(16),
dropoff_borocode SMALLINT,
dropoff_boroname TEXT ENCODING DICT(8),
dropoff_ct2010 TEXT ENCODING DICT(16),
dropoff_boroct2010 TEXT ENCODING DICT(16),
dropoff_cdeligibil TEXT ENCODING DICT(8),
dropoff_ntacode TEXT ENCODING DICT(8),
dropoff_ntaname TEXT ENCODING DICT(8),
dropoff_puma TEXT ENCODING DICT(8)
) WITH (FRAGMENT_SIZE=75000000);
OmniSciDB can import GZIP-compressed CSV files without needing them decompressed ahead of time. It also allows wild stars / globs so the 56 CSV files don't have to be named individually. This is both a big time and disk space saver.
Below I'll run a SQL command that will import the entire dataset into OmniSciDB.
COPY trips
FROM '/Users/mark/taxi_csv/*.gz'
WITH (HEADER='false');
The above managed to complete in 31 minutes and 40 seconds. The resulting import produced 294 GB of data in OmniSciDB's internal format.
Benchmarking OmniSciDB
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
$ omnisql
\timing
The following completed in 0.134 seconds.
SELECT cab_type,
count(*)
FROM trips
GROUP BY cab_type;
The following completed in 0.349 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips
GROUP BY passenger_count;
The following completed in 0.542 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
count(*)
FROM trips
GROUP BY passenger_count,
pickup_year;
The following completed in 3.312 seconds.
SELECT passenger_count,
extract(year from pickup_datetime) AS pickup_year,
cast(trip_distance as int) AS distance,
count(*) AS the_count
FROM trips
GROUP BY passenger_count,
pickup_year,
distance
ORDER BY pickup_year,
the_count desc;
Final Thoughts
The Q1 time is the fastest for any workstation benchmark I've done. To get this level of performance on a regular piece of office equipment is a big game changer. The laptop might seem expensive but it's a one-off purchase that can be depreciated over a few years.
There is something magical when a client points their Tableau installation at OmniSciDB and everything they throw at it appears to run instantaneously. The less friction between questions and answers means more time discovering the value of data. The more questions you ask of your data the greater the chance of discovering the unexpected.
And for the record, for those without a Tableau license to hand, OmniSciDB has a visualisation package called Immerse which also does an amazing job at near-instant visualisation on billions of rows.
To have OmniSciDB running on a regular MacBook Pro and optimised for the Intel CPUs Apple ships with is going to do amazing things for the world of analytics.