A few weeks ago I published a blog post on importing the metadata of a billion+ taxi rides in New York City into Redshift. Since then I've been looking at getting the same dataset into other data stores.
This blog post will cover the steps I took to pull all billion+ records into a Hadoop cluster running in Docker containers on my local machine. I used Docker containers so that I could simulate the steps and shell commands that one would need on a actual Hadoop cluster without having to invest in multiple machines.
I'll contrast the Presto and Hive execution engines and the underlying data formats of ORC and Parquet.
Why ORC over Parquet?
Hive and Presto support working with data stored in several formats. Among others, Optimized Row Columnar (ORC) and Parquet formatted-data can be read from and in some cases written to by Hive and Presto. Each format has its own advantages and trade offs as well as inconsistent behaviours when being used by Hive and Presto.
For example, as of this writing, VARCHAR(n) AND STRING fields in Parquet format aren't yet supported in Presto. INT and SMALLINT data types will be cast to BIGINT when used with Presto but DECIMAL fields need to be changed to the DOUBLE data type for Presto to be able to see them.
At the 2014 Hadoop Summit, Dain Sundstrom, one of the members of the Presto team at Facebook, stated that ORC-formatted data was performing the best among the file formats he was using in his microbenchmarks with Presto. He went on to state that at the time the Praquet reader code in Hive left a lot to be desired in terms of performance.
Presto's ORC reader supports predicate pushdown, the ability to skip past unneeded data, lazy reads and vectorized reads which up until recently, had yet to be ported to the Parquet reader in Presto. ORC files, much like AWS Redshift's Zone Maps, contain the minimum and maximum value of each column per stripe (about 1 million rows) and also for every 10,000 rows. These features can bring significant performance increases for certain workloads.
Why Presto & ORC over PostgreSQL?
Presto can use multiple threads per worker across multiple machines when executing a query. If you have three computers each with a 4-core processor then you'd effectively have 12 cores to execute your query across. PostgreSQL is process-based rather than thread-based so a single query can only use a single CPU core. PostgreSQL doesn't yet support the task of breaking up a single query so that it can run across multiple machines.
ORC format uses run-length encoding (read: compression) on integer columns and dictionary encoding on string columns. This means the data stored as an ORC file will use up less disk space than it would as a PostgreSQL database or as gzip files.
When the trip data was imported into PostgreSQL it was 321 GB. Adding b-tree indices to the various columns could potentially double the size of whole database.
When the data was denormalised and dumped to CSV format it was around 500 GB uncompressed and 104 GB when gzip'ed.
When the data is imported into HDFS using out-of-the-box settings and stored in ORC format it only needs 42 GB of space on each HDFS data node that the data lives on.
Hive Up and Running
Hive is used to create the ORC-formatted files Presto will work with and run the Metastore that Presto relies on as well. But Hive won't be used to run any analytical queries from Presto itself. Facebook have stated that Presto is able to run queries significantly faster than Hive as my benchmarks below will show. Despite that, as of version 0.138 of Presto, there are some steps in the ETL process that Presto still leans on Hive for.
This process begins with a fresh Ubuntu 15 installation acting as the host for the Docker containers. Apache Bigtop will then run various scripts to setup a Hadoop cluster within those containers. I've allocated 8 GB of memory to the machine and there is 2 TB GB of mechanical drive space available.
To start, the following will setup the dependencies needed to build and run the Docker containers. Docker will be installed via a 3rd-party repository so we'll need to add that repository's credentials and details in first.
$ sudo apt install software-properties-common
$ sudo apt-key adv \
--keyserver hkp://p80.pool.sks-keyservers.net:80 \
--recv-keys 58118E89F3A912897C070ADBF76221572C52609D
$ sudo add-apt-repository \
"deb https://apt.dockerproject.org/repo ubuntu-wily main"
$ sudo apt update
If there happens to be any existing Docker installation on the system make sure it's removed to avoid any version conflicts.
$ sudo apt purge lxc-docker
You may need extra drivers which are left out of the base Kernel package that comes with Ubuntu. I found with Ubuntu 15 this wasn't needed but if the following commands does install anything then restart your box after the installation completes.
$ sudo apt install \
linux-image-extra-$(uname -r)
If you see the following then no extra drivers were installed and you can continue on without a reboot.
0 to upgrade, 0 to newly install, 0 to remove and 83 not to upgrade.
The following will install Docker, Virtualbox and Vagrant:
$ sudo apt install \
docker-engine \
virtualbox \
vagrant
$ sudo service docker start
With Docker ready I'll checkout the Bigtop git repository and launch the Ubuntu 14.04-based containers (as of this writing this is the latest supported version of Ubuntu on Intel-based systems).
$ git clone https://github.com/apache/bigtop.git
$ cd bigtop/bigtop-deploy/vm/vagrant-puppet-docker/
$ sudo docker pull bigtop/deploy:ubuntu-14.04
Below are the settings I've saved into the vagrantconfig.yaml configuration file.
$ vi vagrantconfig.yaml
docker:
memory_size: "8192"
image: "bigtop/deploy:ubuntu-14.04"
boot2docker:
memory_size: "8192"
number_cpus: "2"
repo: "http://bigtop-repos.s3.amazonaws.com/releases/1.0.0/ubuntu/trusty/x86_64"
distro: debian
components: [hadoop, yarn, hive]
namenode_ui_port: "50070"
yarn_ui_port: "8088"
hbase_ui_port: "60010"
enable_local_repo: false
smoke_test_components: [mapreduce, pig]
jdk: "openjdk-7-jdk"
I'll then launch a 3-node cluster.
$ sudo ./docker-hadoop.sh --create 3
Downloading 104 GB of Taxi Data
In my previous blog post on importing 1.1 billion taxi trips into Redshift, I created 56 ~2 GB gzip files containing denormalised taxi trip data in CSV format. You should generate these files if you haven't already and store them in the ~/bigtop/bigtop-deploy/vm/vagrant-puppet-docker folder so they'll be available to each Docker container in the Hadoop cluster.
There will be a need to refer to these files throughout this exercise so I'll create a manifest file of their file names. These lines shouldn't contain any paths so we can use them in S3 URLs and HDFS paths without needing to transform them.
$ vi trips.manifest
trips_xaa.csv.gz
trips_xab.csv.gz
trips_xac.csv.gz
trips_xad.csv.gz
trips_xae.csv.gz
trips_xaf.csv.gz
trips_xag.csv.gz
trips_xah.csv.gz
trips_xai.csv.gz
trips_xaj.csv.gz
trips_xak.csv.gz
trips_xal.csv.gz
trips_xam.csv.gz
trips_xan.csv.gz
trips_xao.csv.gz
trips_xap.csv.gz
trips_xaq.csv.gz
trips_xar.csv.gz
trips_xas.csv.gz
trips_xat.csv.gz
trips_xau.csv.gz
trips_xav.csv.gz
trips_xaw.csv.gz
trips_xax.csv.gz
trips_xay.csv.gz
trips_xaz.csv.gz
trips_xba.csv.gz
trips_xbb.csv.gz
trips_xbc.csv.gz
trips_xbd.csv.gz
trips_xbe.csv.gz
trips_xbf.csv.gz
trips_xbg.csv.gz
trips_xbh.csv.gz
trips_xbi.csv.gz
trips_xbj.csv.gz
trips_xbk.csv.gz
trips_xbl.csv.gz
trips_xbm.csv.gz
trips_xbn.csv.gz
trips_xbo.csv.gz
trips_xbp.csv.gz
trips_xbq.csv.gz
trips_xbr.csv.gz
trips_xbs.csv.gz
trips_xbt.csv.gz
trips_xbu.csv.gz
trips_xbv.csv.gz
trips_xbw.csv.gz
trips_xbx.csv.gz
trips_xby.csv.gz
trips_xbz.csv.gz
trips_xca.csv.gz
trips_xcb.csv.gz
trips_xcc.csv.gz
trips_xcd.csv.gz
If you've kept these files in an S3 bucket then the following should download them. Be sure to change the bucket name to the one you've used.
$ sudo apt install s3cmd
$ s3cmd --configure
$ while read filename; do
s3cmd get s3://trips_metadata_example/$filename
done < trips.manifest
SSH into the Hadoop Cluster
With all that in place I'll ssh into the first Bigtop instance.
$ sudo vagrant ssh bigtop1
Inside the container I'll configure the Hive Metastore. This contains all the metadata needed for Hive and Presto to work with the data files we'll be storing in HDFS. The Metastore will be configured to use MySQL as its data backend.
$ apt install \
libmysql-java \
mysql-server
$ ln -s /usr/share/java/mysql-connector-java.jar \
/usr/lib/hive/lib/mysql-connector-java.jar
$ /etc/init.d/mysql start
$ mysql -uroot -proot -e'CREATE DATABASE hcatalog;'
When you install MySQL you'll be prompted to set a root login and password. I've set both values to root for this exercise.
Below is the Hive site configuration file.
$ vi /etc/hive/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>bigtop1.docker</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/hcatalog?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>/usr/lib/hive/lib/hive-hwi.war</value>
</property>
</configuration>
I then made sure the Metastore's database exists.
$ echo 'CREATE DATABASE IF NOT EXISTS metastore_db;' | hive
After that I launched its Metastore service in the background. You'll see it binded to TCP port 9083 if it's running.
$ hive --service metastore &
$ netstat -an | grep 9083
Setup Tables in Hive
We will need a holding table to bring the CSV-formatted data into before we can then load that data into an ORC-formatted table. The CSV-formatted table will be loaded up one file at a time. Then that data will then be inserted into an ORC-formatted table and then the CSV-formatted table will be truncated.
Presto doesn't support as many data types as Hive in either ORC or Parquet-formatted files. In Presto, any column using a data type that is supported in Hive but isn't supported in Presto simply won't appear among the fields that are using supported data types. I've had cases where 50-odd fields are showing for a table in Hive and only 29 fields appears when I loaded the same table up in Presto.
If you use the INT and SMALLINT data types they'll be cast to BIGINT when used with Presto. DECIMAL fields need to be changed to use the DOUBLE data type before Presto will be able to see them.
Below I'll setup two Hive tables, one to import the CSV data into and an ORC-formatted table that will soon contain all billion+ records from the taxi dataset.
$ hive
CREATE TABLE trips_csv (
trip_id INT,
vendor_id VARCHAR(3),
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag VARCHAR(1),
rate_code_id SMALLINT,
pickup_longitude DECIMAL(18,14),
pickup_latitude DECIMAL(18,14),
dropoff_longitude DECIMAL(18,14),
dropoff_latitude DECIMAL(18,14),
passenger_count SMALLINT,
trip_distance DECIMAL(6,3),
fare_amount DECIMAL(6,2),
extra DECIMAL(6,2),
mta_tax DECIMAL(6,2),
tip_amount DECIMAL(6,2),
tolls_amount DECIMAL(6,2),
ehail_fee DECIMAL(6,2),
improvement_surcharge DECIMAL(6,2),
total_amount DECIMAL(6,2),
payment_type VARCHAR(3),
trip_type SMALLINT,
pickup VARCHAR(50),
dropoff VARCHAR(50),
cab_type VARCHAR(6),
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel VARCHAR(10),
pickup_borocode SMALLINT,
pickup_boroname VARCHAR(13),
pickup_ct2010 VARCHAR(6),
pickup_boroct2010 VARCHAR(7),
pickup_cdeligibil VARCHAR(1),
pickup_ntacode VARCHAR(4),
pickup_ntaname VARCHAR(56),
pickup_puma VARCHAR(4),
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel VARCHAR(10),
dropoff_borocode SMALLINT,
dropoff_boroname VARCHAR(13),
dropoff_ct2010 VARCHAR(6),
dropoff_boroct2010 VARCHAR(7),
dropoff_cdeligibil VARCHAR(1),
dropoff_ntacode VARCHAR(4),
dropoff_ntaname VARCHAR(56),
dropoff_puma VARCHAR(4)
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
CREATE TABLE trips_orc (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
rate_code_id SMALLINT,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
passenger_count SMALLINT,
trip_distance DOUBLE,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
ehail_fee DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
payment_type STRING,
trip_type SMALLINT,
pickup STRING,
dropoff STRING,
cab_type STRING,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel STRING,
pickup_borocode SMALLINT,
pickup_boroname STRING,
pickup_ct2010 STRING,
pickup_boroct2010 STRING,
pickup_cdeligibil STRING,
pickup_ntacode STRING,
pickup_ntaname STRING,
pickup_puma STRING,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel STRING,
dropoff_borocode SMALLINT,
dropoff_boroname STRING,
dropoff_ct2010 STRING,
dropoff_boroct2010 STRING,
dropoff_cdeligibil STRING,
dropoff_ntacode STRING,
dropoff_ntaname STRING,
dropoff_puma STRING
) STORED AS orc;
Importing 1.1 Billion Trips into an ORC-Formatted Table
The following commands will list all the file names in the manifest file one at a time. Each file will then be loaded from the vagrant mount into HDFS. Once in HDFS, each file will be imported into a CSV-formatted table. Once that data is loaded into the table the original file on HDFS will automatically be removed. The data in the CSV-formatted table will then be loaded into the ORC-formatted table. Once that's done the CSV-formatted table will be truncated.
while read filename; do
echo $filename
hadoop fs -put /vagrant/$filename /tmp/
echo "LOAD DATA INPATH '/tmp/$filename'
INTO TABLE trips_csv;
INSERT INTO TABLE trips_orc
SELECT * FROM trips_csv;
TRUNCATE TABLE trips_csv;" | hive
done < /vagrant/trips.manifest
I ran this process on mechanical drive and it took around 14 hours to complete.
Querying with Hive
The SQL features available differ between Hive and Presto so the Presto-equivalent SQL queries shown later on in this blog post will differ slightly from the Hive SQL queries shown below.
The following completed in 11 minutes and 36 seconds:
SELECT cab_type,
count(*)
FROM trips_orc
GROUP BY cab_type;
The following completed in 13 minutes and 43 seconds:
SELECT passenger_count,
avg(total_amount)
FROM trips_orc
GROUP BY passenger_count;
The following completed in 15 minutes and 14 seconds:
SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime);
The following completed in 23 minutes and 48 seconds:
SELECT passenger_count,
year(pickup_datetime) trip_year,
round(trip_distance),
count(*) trips
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime),
round(trip_distance)
ORDER BY trip_year,
trips desc;
Presto Up and Running
I now want to query the same ORC-formatted data using Presto. Presto requires Java 8 so I'll install that first.
$ add-apt-repository ppa:webupd8team/java
$ apt update
$ apt install oracle-java8-installer
Facebook aim to release a new version of Presto every two weeks so check to see which version is the most recent when installing from their gzip'ed binary distribution.
$ cd ~
$ wget -c https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.138/presto-server-0.138.tar.gz
$ tar xzf presto-server-0.138.tar.gz
Presto requires a data folder for it to store locks, logs and a few other items and also requires a number of configuration files before it can begin to work properly.
$ mkdir -p /root/datap
$ mkdir -p ~/presto-server-0.138/etc/catalog
$ cd ~/presto-server-0.138/etc
We need to create six configuration files. Below is an outline of where they live within the ~/presto-server-0.138/etc folder:
$ tree ~/presto-server-0.138/etc
etc
|-- catalog
| |-- hive.properties
| `-- jmx.properties
|-- config.properties
|-- jvm.config
|-- log.properties
`-- node.properties
Here are the commands to set the contents of each of the configuration files.
Below is the general server configuration. It will setup Presto to work in a standalone mode. In a real world scenario you'd want Presto running on several machines in order to maximise performance.
$ vi ~/presto-server-0.138/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=800MB
query.max-memory-per-node=200MB
discovery-server.enabled=true
discovery.uri=http://127.0.0.1:8080
JVM configuration:
vi ~/presto-server-0.138/etc/jvm.config
-server
-Xmx800M
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
Logging settings:
$ vi ~/presto-server-0.138/etc/log.properties
com.facebook.presto=INFO
Node settings:
$ vi ~/presto-server-0.138/etc/node.properties
node.environment=dev
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/root/datap
Catalog settings for Java Management Extensions (JMX):
$ vi ~/presto-server-0.138/etc/catalog/jmx.properties
connector.name=jmx
Catalog settings for Hive:
$ vi ~/presto-server-0.138/etc/catalog/hive.properties
hive.metastore.uri=thrift://bigtop1.docker:9083
connector.name=hive-hadoop2
With those in place you can launch Presto's server.
$ ~/presto-server-0.138/bin/launcher start
It should expose a web frontend on port 8080 if it's running properly.
$ curl localhost:8080
The follow will setup the CLI tool for Presto:
$ cd ~
$ wget -c https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.138/presto-cli-0.138-executable.jar
$ mv presto-cli-0.138-executable.jar presto
$ chmod +x presto
Querying with Presto
It's no surprise that Presto will outperform Hive for many types of queries. Presto compiles queries down to byte code before the JVM turns it into machine code, it uses flat data structures that help eliminate almost all garbage collection issues and uses pipeline execution for interactive results.
Bar some syntax changes, I'll run the same queries on Presto that I ran on Hive and see how much quicker they perform.
$ ~/presto \
--server localhost:8080 \
--catalog hive \
--schema default
The following completed in 1 minute and 57 seconds (~6x faster than Hive):
SELECT cab_type,
count(*)
FROM trips_orc
GROUP BY cab_type;
The following completed in 2 minutes and 57 seconds (~4.5x faster than Hive):
SELECT passenger_count,
avg(total_amount)
FROM trips_orc
GROUP BY passenger_count;
The following completed in 4 minutes and 13 seconds (~3.5x faster than Hive):
SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime);
The following completed in 5 minutes and 6 seconds (~4.5x faster than Hive):
SELECT passenger_count,
year(pickup_datetime) trip_year,
round(trip_distance),
count(*) trips
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime),
round(trip_distance)
ORDER BY trip_year,
trips desc;