The Raspberry Pi is a £29, UK-built, single-board computer. To date more than 12.5 million units have been sold. In this benchmark I'll use three Raspberry Pis, a few Micro SD cards and an old 7200 RPM hard drive and see what sort of query performance Spark 2.2 can achieve on a cluster of these devices.
The dataset I'll be using is a data dump I've produced of 1.1 billion taxi trips conducted in New York City over a six year period. The Billion Taxi Rides in Redshift blog post goes into detail on how I put the dataset together. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, BrytlytDB, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.
The Pi's Performance
The model of the device I'm using in this blog, 3 Model B Rev 1.2, comes with a 4-core 64-bit 1.2 GHz ARMv7 CPU, 1 GB of memory, a built-in SD Card reader that the device can boot off of and a wireless adaptor. The device can be run in headless mode where instructions on setting up Wi-Fi and SSH access can be given by files placed on the Micro SD card the device is booted from.
The CPU and memory don't tell the full picture of how fast the system can interact with storage and networking. To start, the USB ports, which are handy for plugging in additional storage, share a bus, and its bandwidth, with the Wi-Fi adaptor.
$ lsusb -t
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=dwc_otg/1p, 480M
|__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/5p, 480M
|__ Port 1: Dev 3, If 0, Class=Vendor Specific Class, Driver=smsc95xx, 480M
The USB ports support interface version 2.0 which has a max throughput of ~53 MB/s. I plugged a 2.5" Western Digital Black 500GB 7200 RPM disk drive in via a SATA 3 to USB converter and found I could reach a little shy of 47 MB/s.
$ sudo dd if=/dev/zero of=/mnt/usb/test bs=500K count=1024
524288000 bytes (524 MB, 500 MiB) copied, 11.2016 s, 46.8 MB/s
The various SD Micro cards I used for boot and storage drives have been benchmarked to each speeds of up to 78 MB/s but when I benchmarked them plugged into the built-in card reader on the Raspberry Pi I couldn't write much past the 10 MB/s mark and reads were under 25 MB/s for the most part.
$ dd if=/dev/zero of=~/test bs=500K count=1024
524288000 bytes (524 MB, 500 MiB) copied, 62.7941 s, 8.3 MB/s
$ sync
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ dd if=~/test of=/dev/null bs=500K count=1024
524288000 bytes (524 MB, 500 MiB) copied, 23.6529 s, 22.2 MB/s
The good thing about the Hadoop ecosystem is that I can horizontally scale workloads amongst many machines and jobs will process at the aggregated rate of each the machines in the cluster. Adding more Raspberry Pis and storage devices to a cluster should speed up query times.
A Raspberry Pi Cluster
I will be using three Raspberry Pi 3 Model Bs in this benchmark. They're named r1, r2 and r3.
The r1 device has a 128 GB Sandisk MicroSD card that it'll use for the OS, applications and HDFS storage. I found I couldn't reach more than 3.2 MB/s when transferring data via SSH to the Raspberry Pi so I've attached a 2.5" Western Digital Black 500GB 7200 RPM disk drive in via a SATA 3 to USB converter that holds a copy of the taxi trips dataset.
The r2 device has a 32 GB Micro SD card plugged into the on-board slot and another 64 GB Sandisk Micro SD card plugged in via a USB adaptor. The dataset is 86 GB in ORC format and each Raspberry Pi will have their own copy of the dataset to improve data locality when querying data. The data is stored on an HDFS cluster spread across the three Raspberry Pis and HDFS supports using more than one storage device on each machine. Using both the 32 GB and the 64 GB Micro SD cards means 96 GB of pre-formatted capacity will be available for the OS, Hadoop and HDFS storage which should be just enough for everything to fit.
The r3 device also has a 32 GB Micro SD card plugged into the on-board slot and another 64 GB Sandisk Micro SD card plugged in via a USB adaptor.
I used a USB-powered fan to cool the machines. I've posted a photo of the cluster on Twitter.
Bootstrapping the Pis
On each Micro SD card I use to boot each of the Raspberry Pis I've used Etcher to write an image of the August 16th release of Raspbian Stretch Lite, a minimalist Linux distribution for the Raspberry Pi which is based on Debian Stretch. The image is 348 MB compressed and decompresses into a 1.8 GB image. Etcher should work well on Windows, Linux and MacOS.
Once the image is on each of the MicroSD cards I enable SSH access by creating an empty ssh file on the boot partition.
$ touch /Volumes/boot/ssh
I also include Wi-Fi connection details which will be picked up and used during boot-up.
$ vi /Volumes/boot/wpa_supplicant.conf
ctrl_interface=DIR=/var/run/wpa_supplicant GROUP=netdev
update_config=1
network={
ssid="<network name>"
psk="<password>"
}
About 6 seconds after I plugged the power cable into each Raspberry Pi I could see devices beginning with the Ethernet mac address prefix of B8:27:EB appear in my Router's ARP table. This prefix is owned by the Raspberry Pi Foundation which makes it easy to distinguish from other devices on the network. The ARP listing is how I found the IP addresses of the Raspberry Pis.
By default the username to SSH in is pi and the password is raspberry. When you SSH in you'll be given the following message:
SSH is enabled and the default password for the 'pi' user has not been changed.
This is a security risk - please login as the 'pi' user and type 'passwd' to set a new password.
This will cause issues with rsync where you'll get errors like the following:
protocol version mismatch -- is your shell clean?
(see the rsync man page for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(178) [sender=3.1.2]
So if you don't change the default password you'll need to remove the warning script to stop that message disturbing rsync.
sudo rm /etc/profile.d/sshpwd.sh
Installing HDFS, Hive & Spark
This is the list of prerequisite packages I installed. iotop and nethogs are for telemetry and are optional. mysql-server is only used on r1 and you'll save yourself some memory and CPU cycles by not installing it on r2 and r3. The Micro SD cards and the 500 GB disk drive plugged in via the USB ports are formatted with the exfat file system which isn't supported out of the box with Raspbian so exfat-fuse and exfat-utils are needed in order to interact with them.
sudo apt update
sudo apt install \
exfat-fuse \
exfat-utils \
iotop \
mysql-server \
nethogs \
oracle-java8-jdk
On r1 I'll create a user for Hive in MySQL / MariaDB.
$ sudo su
$ mysql -uroot
CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
GRANT ALL PRIVILEGES ON *.* TO 'hive'@'localhost';
FLUSH PRIVILEGES;
$ exit
For various reasons Hadoop needs to refer to other nodes in the cluster by hostname so I'll add them to the hosts file on all three devices.
$ sudo vi /etc/hosts
192.168.0.16 r1
192.168.0.22 r2
192.168.0.25 r3
To ease memory pressure I'll expand the 100 MB SWAP file to 2,000 MB by changing the CONF_SWAPSIZE setting in /etc/dphys-swapfile on all three devices as well.
$ sudo vi /etc/dphys-swapfile
CONF_SWAPSIZE=2000
I'll then restart each of the devices so they'll pick up that SWAP file change.
$ sudo reboot
By default Hadoop uses the root account to SSH onto each of the nodes in the cluster. I'll create SSH keys to make sure this is a password-less process. On r1 I'll generate a new key pair and add it to the authorized keys of r1's root account.
$ sudo su
$ ssh-keygen
$ cp /root/.ssh/id_rsa.pub \
/root/.ssh/authorized_keys
I don't have an SSH password for the root account on r2 or r3 so I'll copy it to the authorized_keys file for the pi user on those devices.
$ ssh-copy-id pi@r2
$ ssh-copy-id pi@r3
$ exit
Then on r2 and r3 I'll bootstrap the .ssh folder for the root accounts on those machines and copy the authorized_keys file from the pi user's .ssh folder so the root user can accept it as well.
$ sudo su
$ ssh-keygen
$ cp /home/pi/.ssh/authorized_keys \
/root/.ssh/authorized_keys
$ exit
There are settings that will be used by HDFS, Hive and Spark and by both the root and the pi user accounts. To centralise these settings I've stored them in /etc/profile and created a symbolic link from /root/.bashrc to this file as well. That way all users will have the same settings and they can be centrally managed on each device.
$ sudo vi /etc/profile
if [ "${PS1-}" ]; then
if [ "${BASH-}" ] && [ "$BASH" != "/bin/sh" ]; then
# The file bash.bashrc already sets the default PS1.
# PS1='\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
else
if [ "`id -u`" -eq 0 ]; then
PS1='# '
else
PS1='$ '
fi
fi
fi
if [ -d /etc/profile.d ]; then
for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
. $i
fi
done
unset i
fi
export HADOOP_HOME=/opt/hadoop
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:/opt/hive/bin:/opt/spark/bin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export SPARK_HOME=/opt/spark
export SPARK_CONF_DIR=/opt/spark/conf
export SPARK_MASTER_HOST=r1
export JAVA_HOME=/usr/lib/jvm/jdk-8-oracle-arm32-vfp-hflt/jre
$ sudo ln -sf /etc/profile \
/root/.bashrc
$ source /etc/profile
On r1 I'll create the folders used by the various Hadoop tools used in this benchmark.
$ sudo mkdir -p /opt/{hadoop,hdfs/{datanode,namenode},hive,spark}
The 500 GB hard drive is connected and represented by /dev/sda1. I'll mount it to /mnt/usb.
$ sudo mkdir -p /mnt/usb
$ sudo mount /dev/sda1 /mnt/usb
On r2 and r3 I'll mount the Micro SD cards that are plugged in via USB adaptors (not the built-in adaptor, that card is already mounted).
$ sudo mkdir -p /mnt/usb
$ sudo mount /dev/sda1 /mnt/usb
I'll then create the application folders and the two data node folders HDFS will use for heterogeneous storage.
$ sudo mkdir -p /opt/{hadoop,hdfs/datanode,spark},/mnt/usb/hdfs/datanode
Next I'll download the three Hadoop software packages onto r1.
$ DIST=http://www-eu.apache.org/dist
$ wget -c -O hadoop.tar.gz $DIST/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
$ wget -c -O hive.tar.gz $DIST/hive/hive-2.3.0/apache-hive-2.3.0-bin.tar.gz
$ wget -c -O spark.tgz $DIST/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
Hadoop is 405 MB in size when compressed, Hive is 221 MB and Spark is 194 MB. Hadoop expands to 2.1 GB but 1.9 GB of that is documentation so I'll exclude the docs from the extraction.
$ sudo tar xvf hadoop.tar.gz \
--directory=/opt/hadoop \
--exclude=hadoop-2.8.1/share/doc \
--strip 1
Hive is 172 MB decompressed but 102 MB of that is unit tests so I'll exclude those from extraction.
$ sudo tar xvf hive.tar.gz \
--directory=/opt/hive \
--exclude=apache-hive-2.3.0-bin/ql/src/test \
--strip 1
The following will extract Spark to its installation folder.
$ sudo tar xzvf spark.tgz \
--directory=/opt/spark \
--strip 1
I'll specify the master and slaves for the HDFS cluster. r1 will serve as both a master and a slave so that all the Raspberry Pis will be busy when processing workloads.
$ sudo vi /opt/hadoop/etc/hadoop/master
r1
$ sudo vi /opt/hadoop/etc/hadoop/slaves
r1
r2
r3
I'll then create two files with configuration overrides needed for this HDFS cluster. I'll be setting a default replication factor of 3 for all the files stored on HDFS so that they're copied onto each machine in full. There are multiple storage folders used on r2 and r3 and to avoid filling the Micro SD card used by both HDFS and the OS I've set a limit of 3 GB that must be available before HDFS writes any blocks to a partition.
$ sudo vi /opt/hadoop/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://r1:9000/</value>
</property>
<property>
<name>fs.default.FS</name>
<value>hdfs://r1:9000/</value>
</property>
</configuration>
$ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>r1:50070</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>3221225472</value>
</property>
</configuration>
I'll then sync Hadoop's binaries and configuration onto the other two Raspberry Pis.
$ for SERVER in r2 r3
do
sudo rsync --archive \
--one-file-system \
--partial \
--progress \
--compress \
/opt/hadoop/ $SERVER:/opt/hadoop/
done
On r2 and r3 I'll adjust the HDFS configuration to include both storage folders.
$ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.datanode.data.dir</name>
<value>/mnt/usb/hdfs/datanode,/opt/hdfs/datanode</value>
<final>true</final>
</property>
At this point I'll need to load an interactive root shell in order to run three commands.
$ sudo su
The first command will format the HDFS name node.
$ hdfs namenode -format
The next will launch HDFS across the whole cluster. This command will SSH as the root user into each device.
$ start-dfs.sh
The third command sets permissive access for the pi user on HDFS.
$ hdfs dfs -chown pi /
Once that's all done I can check the capacity available across the cluster. The first line of output is the aggregate of each of the devices. The remaining three lines are the amount of capacity on each respective device.
$ hdfs dfsadmin -report | grep 'Configured Capacity'
Configured Capacity: 314337058816 (292.75 GB)
Configured Capacity: 125850886144 (117.21 GB)
Configured Capacity: 94243086336 (87.77 GB)
Configured Capacity: 94243086336 (87.77 GB)
The dataset is 84.5 GB when stored as ORC files so there is just enough space for this data set on each node in the cluster.
$ exit
Hive will represent the ORC files on HDFS as a table that Spark can query using SQL. The following will configure Hive to use MySQL / MariaDB to store its metadata. This only needs to happen on r1.
$ sudo vi /opt/hive/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
I'll then download the MySQL / MariaDB connector for Hive to use.
$ sudo wget -c http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.28/mysql-connector-java-5.1.28.jar \
-P /opt/hive/lib/
I'll then initialise the schema and launch the Hive Metastore.
$ schematool -dbType mysql -initSchema
$ hive --service metastore &
Spark will need to know of Hive's configuration settings so I'll link the configuration file into Spark's configuration folder.
$ sudo ln -s /opt/hive/conf/hive-site.xml \
/opt/spark/conf/hive-site.xml
Spark too will also need to use the same MySQL / MariaDB connector.
$ sudo ln -s /opt/hive/lib/mysql-connector-java-5.1.28.jar \
/opt/spark/jars/mysql-connector-java-5.1.28.jar
When you launch pyspark, spark-submit or spark-sql the Spark libraries from the master node are copied onto HDFS and shared amongst the worker nodes. Reading 200 MB off of the Micro SD card every time one of these applications launches adds a lot of delay so I'll package up these libraries, upload them to HDFS and in the Spark configuration I'll make sure the cached jar of libraries is used instead.
$ jar cv0f ~/spark-libs.jar -C /opt/spark/jars/ .
$ hdfs dfs -mkdir /spark-libs
$ hdfs dfs -put ~/spark-libs.jar /spark-libs/
$ sudo vi /opt/spark/conf/spark-defaults.conf
spark.master spark://r1:7077
spark.yarn.preserve.staging.files true
spark.yarn.archive hdfs:///spark-libs/spark-libs.jar
I found a 650 MB memory limit on the various Spark components allowed everything to work without complaining.
$ sudo vi /opt/spark/conf/spark-env.sh
SPARK_EXECUTOR_MEMORY=650m
SPARK_DRIVER_MEMORY=650m
SPARK_WORKER_MEMORY=650m
SPARK_DAEMON_MEMORY=650m
Spark jobs will run on all three Raspberry Pis.
$ sudo vi /opt/spark/conf/slaves
r1
r2
r3
With that done I'll distribute Spark and its configuration to the other nodes.
$ for SERVER in r2 r3
do
sudo rsync --archive \
--one-file-system \
--partial \
--progress \
--compress \
--exclude /opt/spark/logs \
/opt/spark/ $SERVER:/opt/spark/
done
To save memory I didn't launch Spark until after I have populated all the data onto HDFS but it makes sense to mention the launch commands here. They are as follows:
$ sudo /opt/spark/sbin/start-master.sh
$ sudo /opt/spark/sbin/start-slaves.sh
Loading 1.1 Billion Trips onto HDFS
The original dataset of 1.1 billion records is ~500 GB in uncompressed CSV format. I compressed those files into 56 gzip files which take up around 104 GB of space. Spark queries ORC-formatted data very well and ORC uses zlib to compress data by column bringing the dataset down to 84.5 GB. For the task of converting the dataset into ORC I followed the instructions from my 1.1 Billion Taxi Rides on AWS EMR 5.3.0 & Spark 2.1.0 benchmark. AWS EMR saved the ORC files onto S3 and I then downloaded them onto the 500 GB hard drive.
The Wi-Fi speeds on the Raspberry Pis were never as quick as connecting storage to the devices. HDFS still needs to replicate data via Wi-Fi but I could at least get the data onto r1 quickly and then let the fault-tolerant process of replication take its time.
The following mounted the 500 GB drive onto r1 and copied the data stored in the orcs folder onto HDFS.
$ sudo mkdir -p /mnt/usb
$ sudo mount /dev/sda1 /mnt/usb
$ hdfs dfs -copyFromLocal /mnt/usb/orcs/* /trips/
The data took around 4 hours to copy onto r1. During that time r1 was replicating the data onto r2 and r3. The replication between machines never broke past 1.1 MB/s. There were a few occasions where one of the Raspberry Pis would lose connectivity and I'd have to restart the device and then the HDFS cluster again to continue the replication process. I spent 1.5 days watching the replication process.
At one point I could see the primary drives on two of the Raspberry Pis had been completely filled. The 3 GB minimum reserve of free space limit I had set looks to have been ignored. The following is the disk space reports from all three Raspberry Pis.
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 118G 91G 22G 81% /
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 29G 28G 0 100% /
/dev/sda1 60G 57G 3.1G 95% /mnt/usb
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/root 29G 28G 8.4M 100% /
/dev/sda1 60G 57G 3.1G 95% /mnt/usb
I checked the /trips/ folder on HDFS and it said all but 53 blocks of the total 704 had been fully replicated. This means when I'm querying the full dataset devices may need to ask one another for parts of the dataset. I wasn't sure if I could get the remaining 53 blocks to replicate properly so I accepted this and pushed on with the benchmark.
$ hdfs fsck /trips/
...
......Status: HEALTHY
Total size: 90808380700 B
Total dirs: 1
Total files: 56
Total symlinks: 0
Total blocks (validated): 704 (avg. block size 128989177 B)
Minimally replicated blocks: 704 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 53 (7.528409 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 2.8551137
Corrupt blocks: 0
Missing replicas: 102 (4.8295455 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Sun Sep 17 12:24:03 UTC 2017 in 45 milliseconds
The filesystem under path '/trips' is HEALTHY
With the data on HDFS, albeit not completely replicated, I created the database table that would represent the ORC files to Spark.
$ hive
CREATE EXTERNAL TABLE trips_orc (
trip_id INT,
vendor_id STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
store_and_fwd_flag STRING,
rate_code_id SMALLINT,
pickup_longitude DOUBLE,
pickup_latitude DOUBLE,
dropoff_longitude DOUBLE,
dropoff_latitude DOUBLE,
passenger_count SMALLINT,
trip_distance DOUBLE,
fare_amount DOUBLE,
extra DOUBLE,
mta_tax DOUBLE,
tip_amount DOUBLE,
tolls_amount DOUBLE,
ehail_fee DOUBLE,
improvement_surcharge DOUBLE,
total_amount DOUBLE,
payment_type STRING,
trip_type SMALLINT,
pickup STRING,
dropoff STRING,
cab_type STRING,
precipitation SMALLINT,
snow_depth SMALLINT,
snowfall SMALLINT,
max_temperature SMALLINT,
min_temperature SMALLINT,
average_wind_speed SMALLINT,
pickup_nyct2010_gid SMALLINT,
pickup_ctlabel STRING,
pickup_borocode SMALLINT,
pickup_boroname STRING,
pickup_ct2010 STRING,
pickup_boroct2010 STRING,
pickup_cdeligibil STRING,
pickup_ntacode STRING,
pickup_ntaname STRING,
pickup_puma STRING,
dropoff_nyct2010_gid SMALLINT,
dropoff_ctlabel STRING,
dropoff_borocode SMALLINT,
dropoff_boroname STRING,
dropoff_ct2010 STRING,
dropoff_boroct2010 STRING,
dropoff_cdeligibil STRING,
dropoff_ntacode STRING,
dropoff_ntaname STRING,
dropoff_puma STRING
) STORED AS orc
LOCATION '/trips/';
Benchmarking Spark
The times quoted below are the lowest query times seen during a series of runs. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed".
$ spark-sql \
--master spark://r1:7077 \
--num-executors 3
The following completed in 18 minutes and 23.73 seconds.
SELECT cab_type,
count(*)
FROM trips_orc
GROUP BY cab_type;
The following completed in 19 minutes and 58.59 seconds.
SELECT passenger_count,
avg(total_amount)
FROM trips_orc
GROUP BY passenger_count;
The following completed in 37 minutes and 58.121 seconds.
SELECT passenger_count,
year(pickup_datetime),
count(*)
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime);
The following completed in 1 hour, 47 minutes and 25.98 seconds.
SELECT passenger_count,
year(pickup_datetime) trip_year,
round(trip_distance),
count(*) trips
FROM trips_orc
GROUP BY passenger_count,
year(pickup_datetime),
round(trip_distance)
ORDER BY trip_year,
trips desc;
Telemetry from the Benchmark
I captured a few pieces of telemetry while the queries were running that I found interesting.
The load averages across the cluster show the CPUs were very busy and most likely a bottleneck through much of the workloads.
3.78, 1.89, 0.97
3.91, 1.50, 0.59
4.02, 1.53, 0.61
For the most part there was near-radio silence between the devices. Interconnectivity wasn't the bottleneck I feared the lack of replication would make it.
PID USER PROGRAM DEV SENT RECEIVED
5574 pi ..sr/lib/jvm/jdk-8-or wlan0 4.886 6.446 KB/sec
25771 root ..sr/lib/jvm/jdk-8-or wlan0 4.927 10.085 KB/sec
25066 root ..sr/lib/jvm/jdk-8-or wlan0 2.890 9.026 KB/sec
Memory pressures were there but it wasn't as bad on r2 and r3 as it was on r1.
KiB Mem : 949572 total, 31156 free, 719220 used, 199196 buff/cache
KiB Swap: 2047996 total, 1370452 free, 677544 used. 178268 avail Mem
KiB Mem : 949572 total, 29236 free, 455132 used, 465204 buff/cache
KiB Swap: 2047996 total, 1983956 free, 64040 used. 426000 avail Mem
KiB Mem : 949572 total, 28772 free, 465396 used, 455404 buff/cache
KiB Swap: 2047996 total, 1998696 free, 49300 used. 415204 avail Mem
Between reading and decompressing ORC files lies the main bottlenecks. Still, I'm impressed these cheap machines could do the job.
Total DISK READ : 5.87 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 12.22 M/s | Actual DISK WRITE: 1551.84 K/s
TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND
5590 be/4 pi 433.00 M 100.00 K 12.19 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5757 be/4 pi 32.13 M 4.00 K 1.07 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5594 be/4 pi 26.69 M 12.00 K 1.59 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5761 be/4 pi 21.22 M 0.00 B 1.20 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5755 be/4 pi 19.81 M 0.00 B 1.03 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5679 be/4 pi 18.87 M 0.00 B 0.69 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5708 be/4 pi 16.49 M 0.00 B 0.44 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
5677 be/4 pi 16.01 M 0.00 B 0.67 % 0.04 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~r.SparkSQLCLIDriver --num-executors 3 spark-internal
Total DISK READ : 18.41 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 11.65 M/s | Actual DISK WRITE: 713.97 K/s
TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND
606 be/4 root 5.05 G 524.00 K 0.00 % 26.04 % mount.exfat /dev/sda1 /mnt/usb -o rw,nonempty
25774 be/4 root 148.02 M 96.00 K 3.17 % 0.77 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.22:43827
1231 be/4 root 6.48 M 0.00 B 0.00 % 24.02 % java -Dproc_datanode -Xmx1000m -Djava.library.path=~RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
25833 be/4 root 5.93 M 0.00 B 0.04 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.22:43827
25614 be/4 root 5.63 M 24.00 K 0.21 % 0.00 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker --webui-port 8081 spark://r1:7077
25779 be/4 root 5.25 M 24.00 K 4.06 % 0.61 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.22:43827
25618 be/4 root 4.75 M 0.00 B 1.71 % 0.02 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker --webui-port 8081 spark://r1:7077
Total DISK READ : 31.17 M/s | Total DISK WRITE : 27.10 K/s
Actual DISK READ: 25.37 M/s | Actual DISK WRITE: 304.85 K/s
TID PRIO USER DISK READ> DISK WRITE SWAPIN IO COMMAND
563 be/4 root 5.80 G 564.00 K 0.00 % 17.56 % mount.exfat /dev/sda1 /mnt/usb -o rw,nonempty
25069 be/4 root 200.09 M 132.00 K 4.73 % 2.72 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.25:44851
25238 be/4 root 7.37 M 396.00 K 0.21 % 0.88 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.25:44851
980 be/4 root 6.93 M 0.00 B 0.00 % 31.54 % java -Dproc_datanode -Xmx1000m -Djava.library.path=~RFAS org.apache.hadoop.hdfs.server.datanode.DataNode
24914 be/4 root 6.26 M 4.00 K 1.95 % 0.15 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker --webui-port 8081 spark://r1:7077
25078 be/4 root 5.35 M 24.00 K 2.69 % 1.48 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.25:44851
24908 be/4 root 5.00 M 24.00 K 1.67 % 0.07 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~ploy.worker.Worker --webui-port 8081 spark://r1:7077
25127 be/4 root 4.52 M 0.00 B 0.11 % 0.01 % java -cp /opt/spark/conf/:/opt/spark/jars/*:/opt/ha~-0000 --worker-url spark://Worker@192.168.0.25:44851
Closing Thoughts
I spend 1-2 weeks a month living out of a suitcase so carrying a few motherboards and power supplies around isn't practical. When I started this project I had three Raspberry Pis shipped overnight to an Amazon drop off point next to one of my client's offices. I was able to flash the Micro SD cards off a MacBook Pro I take with me when working abroad. I got the devices to connect to a Wi-Fi hotspot I ran off of my Samsung Galaxy S8 phone in my hotel room within an hour of unpacking everything. This all felt like a very convenient and portable way to explore both these small devices and revisit Hadoop on a minimalist hardware setup.
I'll admit that the slow network connectivity and slow I/O had me making many diversions along this journey. I'm not sure if most people interested in learning about Hadoop will have the patience to deal with these limitations. If you want a more practical learning experience I'd suggest trying out Amazon EMR as so much is already setup when you launch a cluster. If you do want to use your own hardware at home I'd suggest anything from the "Modest" tier upward on Logical Increments as a shopping list of parts. You should use multiple computers as distributing workloads and horizontal scalability are Hadoop's main selling points. Make sure to find a motherboard with built-in HDMI so you can save money on graphics cards.