The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.
Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.
Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I'll walk through a few of these offerings.
If you'd like to setup an HDFS environment locally please see my Hadoop 3 Single-Node Install Guide (skip the steps for Presto and Spark). I also have posts that cover working with HDFS on AWS EMR and Google Dataproc.
The Apache Hadoop HDFS Client
The Apache Hadoop HDFS client is the most well-rounded HDFS CLI implementation. Virtually any API endpoint that has been built into HDFS can be interacted with using this tool.
For the release of Hadoop 3, considerable effort was put into reorganising the arguments of this tool. This is what they look like as of this writing.
$ hdfs
Admin Commands:
cacheadmin configure the HDFS cache
crypto configure HDFS encryption zones
debug run a Debug Admin to execute HDFS debug commands
dfsadmin run a DFS admin client
dfsrouteradmin manage Router-based federation
ec run a HDFS ErasureCoding CLI
fsck run a DFS filesystem checking utility
haadmin run a DFS HA admin client
jmxget get JMX exported values from NameNode or DataNode.
oev apply the offline edits viewer to an edits file
oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to a legacy fsimage
storagepolicies list/get/set block storage policies
Client Commands:
classpath prints the class path needed to get the hadoop jar and the required libraries
dfs run a filesystem command on the file system
envvars display computed Hadoop environment variables
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
groups get the groups which users belong to
lsSnapshottableDir list all snapshottable dirs owned by the current user
snapshotDiff diff two snapshots of a directory or diff the current directory contents with a snapshot
version print the version
Daemon Commands:
balancer run a cluster balancing utility
datanode run a DFS datanode
dfsrouter run the DFS router
diskbalancer Distributes data evenly among disks on a given node
httpfs run HttpFS server, the HDFS HTTP Gateway
journalnode run the DFS journalnode
mover run a utility to move block replicas across storage types
namenode run the DFS namenode
nfs3 run an NFS version 3 gateway
portmap run a portmap service
secondarynamenode run the DFS secondary namenode
zkfc run the ZK Failover Controller daemon
The bulk of the disk access verbs most people familiar with Linux will recognise are kept under the dfs argument.
$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Notice how the top usage line doesn't mention hdfs dfs but instead hadoop fs. You'll find that either prefix will provide the same functionality if you're working with HDFS as an endpoint.
$ hdfs dfs -ls /
$ hadoop fs -ls /
A Golang-based HDFS Client
In 2014, Colin Marc started work on a Golang-based HDFS client. This tool has two major features that stand out to me. The first is that there is no JVM overhead so execution begins much quicker than the JVM-based client. Second is the arguments align more closely with the GNU Core Utilities commands like ls and cat. This isn't a drop-in replacement for the JVM-based client but it should be a lot more intuitive for those already familiar with GNU Core Utilities file system commands.
The following will install the client on a fresh Ubuntu 16.04.2 LTS system.
$ wget -c -O gohdfs.tar.gz \
https://github.com/colinmarc/hdfs/releases/download/v2.0.0/gohdfs-v2.0.0-linux-amd64.tar.gz
$ tar xvf gohdfs.tar.gz
$ gohdfs-v2.0.0-linux-amd64/hdfs
The release also includes a bash completion script. This is handy for being able to hit tab and get a list of commands or to complete a partially typed-out list of arguments.
I'll include the extracted folder name below to help differentiate this tool from the Apache HDFS CLI.
$ gohdfs-v2.0.0-linux-amd64/hdfs
Valid commands:
ls [-lah] [FILE]...
rm [-rf] FILE...
mv [-nT] SOURCE... DEST
mkdir [-p] FILE...
touch [-amc] FILE...
chmod [-R] OCTAL-MODE FILE...
chown [-R] OWNER[:GROUP] FILE...
cat SOURCE...
head [-n LINES | -c BYTES] SOURCE...
tail [-n LINES | -c BYTES] SOURCE...
du [-sh] FILE...
checksum FILE...
get SOURCE [DEST]
getmerge SOURCE DEST
put SOURCE DEST
df [-h]
As you can see, prefixing many GNU Core Utilities file system commands with the HDFS client will produce the expected functionality on HDFS.
$ gohdfs-v2.0.0-linux-amd64/hdfs df -h
Filesystem Size Used Available Use%
11.7G 24.0K 7.3G 0%
The GitHub homepage for this project shows how listing files can be two orders of magnitude quicker using this tool versus the JVM-based CLI.
This start up speed improvement can be handy if HDFS commands are being invoked a lot. The ideal file size of an ORC or Parquet file for most purposes is somewhere between 256 MB and 2 GB and it's not uncommon to see these being micro-batched into HDFS as they're being generated.
Below I'll generate a file containing a gigabyte of random data.
$ cat /dev/urandom \
| head -c 1073741824 \
> one_gig
Uploading this file via the JVM-based CLI took 18.6 seconds on my test rig.
$ hadoop fs -put one_gig /one_gig
Uploading via the Golang-based CLI took 13.2 seconds.
$ gohdfs-v2.0.0-linux-amd64/hdfs put one_gig /one_gig_2
Spotify's Python-based HDFS Client
In 2014 work began at Spotify on a Python-based HDFS CLI and library called Snakebite. The bulk of commits on this project we put together by Wouter de Bie and Rafal Wojdyla. If you don't require Kerberos support then the only requirements for this client are the Protocol Buffers Python library from Google and Python 2.7. As of this writing Python 3 isn't supported.
The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.
$ sudo apt install \
python \
python-pip \
virtualenv
$ virtualenv .snakebite
$ source .snakebite/bin/activate
$ pip install snakebite
This client is not a drop-in replacement for the JVM-based CLI but shouldn't have a steep learning curve if you're already familiar with GNU Core Utilities file system commands.
$ snakebite
snakebite [general options] cmd [arguments]
general options:
-D --debug Show debug information
-V --version Hadoop protocol version (default:9)
-h --help show help
-j --json JSON output
-n --namenode namenode host
-p --port namenode RPC port (default: 8020)
-v --ver Display snakebite version
commands:
cat [paths] copy source paths to stdout
chgrp <grp> [paths] change group
chmod <mode> [paths] change file mode (octal)
chown <owner:grp> [paths] change owner
copyToLocal [paths] dst copy paths to local file system destination
count [paths] display stats for paths
df display fs stats
du [paths] display disk usage statistics
get file dst copy files to local file system destination
getmerge dir dst concatenates files in source dir into destination local file
ls [paths] list a path
mkdir [paths] create directories
mkdirp [paths] create directories and their parents
mv [paths] dst move paths to destination
rm [paths] remove paths
rmdir [dirs] delete a directory
serverdefaults show server information
setrep <rep> [paths] set replication factor
stat [paths] stat information
tail path display last kilobyte of the file to stdout
test path test a path
text path [paths] output file in text format
touchz [paths] creates a file of zero length
usage <cmd> show cmd usage
The client is missing certain verbs that can be found in the JVM-based client as well as the Golang-based client described above. One of which is the ability to copy files and streams to HDFS.
That being said I do appreciate how easy it is to pull statistics for a given file.
$ snakebite stat /one_gig
access_time 1539530885694
block_replication 1
blocksize 134217728
file_type f
group supergroup
length 1073741824
modification_time 1539530962824
owner mark
path /one_gig
permission 0644
To collect the same information with the JVM client would involve several commands. Their output would also be harder to parse than the key-value pairs above.
As well as being a CLI tool, Snakebite is also a Python library.
$ python
from snakebite.client import Client
client = Client("localhost", 9000, use_trash=False)
[x for x in client.ls(['/'])][:2]
[{'access_time': 1539530885694L,
'block_replication': 1,
'blocksize': 134217728L,
'file_type': 'f',
'group': u'supergroup',
'length': 1073741824L,
'modification_time': 1539530962824L,
'owner': u'mark',
'path': '/one_gig',
'permission': 420},
{'access_time': 1539531288719L,
'block_replication': 1,
'blocksize': 134217728L,
'file_type': 'f',
'group': u'supergroup',
'length': 1073741824L,
'modification_time': 1539531307264L,
'owner': u'mark',
'path': '/one_gig_2',
'permission': 420}]
Note I've asked to connect to localhost on TCP port 9000. Out of the box Hadoop uses TCP port 8020 for the NameNode RPC endpoint. I've often changed this to TCP port 9000 in many of my Hadoop guides.
You can find the hostname and port number configured for this end point on the master HDFS node. Also note that for various reasons HDFS, and Hadoop in general, need to use hostnames rather than IP addresses.
$ sudo vi /opt/hadoop/etc/hadoop/core-site.xml
...
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
...
Python-based HdfsCLI
In 2014, Matthieu Monsch also began work on a Python-based HDFS client called HdfsCLI. Two features this client has over the Spotify Python client is that it supports uploading to HDFS and Python 3 (in addition to 2.7).
Matthieu has previously worked at LinkedIn and now works for Google. The coding style of this project will feel very familiar to anyone that's looked at a Python project that has originated from Google.
This library includes support for a progress tracker, a fast AVRO library, Kerberos and Pandas DataFrames.
The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.
$ sudo apt install \
python \
python-pip \
virtualenv
$ virtualenv .hdfscli
$ source .hdfscli/bin/activate
$ pip install 'hdfs[dataframe,avro]'
In order for this library to communicate with HDFS, WebHDFS needs to be enabled on the master HDFS node.
$ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
</property>
</configuration>
With the configuration in place the DFS service needs to be restarted.
$ sudo su
$ /opt/hadoop/sbin/stop-dfs.sh
$ /opt/hadoop/sbin/start-dfs.sh
$ exit
A configuration file is needed to store connection settings for this client.
$ vi ~/.hdfscli.cfg
[global]
default.alias = dev
[dev.alias]
url = http://localhost:50070
user = mark
A big downside of the client is that only a very limited subset of HDFS functionality is supported. That being said the client verb arguments are pretty self explanatory.
$ hdfscli --help
HdfsCLI: a command line interface for HDFS.
Usage:
hdfscli [interactive] [-a ALIAS] [-v...]
hdfscli download [-fsa ALIAS] [-v...] [-t THREADS] HDFS_PATH LOCAL_PATH
hdfscli upload [-sa ALIAS] [-v...] [-A | -f] [-t THREADS] LOCAL_PATH HDFS_PATH
hdfscli -L | -V | -h
Commands:
download Download a file or folder from HDFS. If a
single file is downloaded, - can be
specified as LOCAL_PATH to stream it to
standard out.
interactive Start the client and expose it via the python
interpreter (using iPython if available).
upload Upload a file or folder to HDFS. - can be
specified as LOCAL_PATH to read from standard
in.
Arguments:
HDFS_PATH Remote HDFS path.
LOCAL_PATH Path to local file or directory.
Options:
-A --append Append data to an existing file. Only supported
if uploading a single file or from standard in.
-L --log Show path to current log file and exit.
-V --version Show version and exit.
-a ALIAS --alias=ALIAS Alias of namenode to connect to.
-f --force Allow overwriting any existing files.
-s --silent Don't display progress status.
-t THREADS --threads=THREADS Number of threads to use for parallelization.
0 allocates a thread per file. [default: 0]
-v --verbose Enable log output. Can be specified up to three
times (increasing verbosity each time).
Examples:
hdfscli -a prod /user/foo
hdfscli download features.avro dat/
hdfscli download logs/1987-03-23 - >>logs
hdfscli upload -f - data/weights.tsv <weights.tsv
HdfsCLI exits with return status 1 if an error occurred and 0 otherwise.
The following finished in 68 seconds. This is an order of magnitude slower than some other clients I'll explore in this post.
$ hdfscli upload \
-t 4 \
-f - \
one_gig_3 < one_gig
That said, the client is very easy to work with in Python.
$ python
from hashlib import sha256
from hdfs import Config
client = Config().get_client('dev')
[client.status(uri)
for uri in client.list('')][:2]
[{u'accessTime': 1539532953515,
u'blockSize': 134217728,
u'childrenNum': 0,
u'fileId': 16392,
u'group': u'supergroup',
u'length': 1073741824,
u'modificationTime': 1539533029897,
u'owner': u'mark',
u'pathSuffix': u'',
u'permission': u'644',
u'replication': 1,
u'storagePolicy': 0,
u'type': u'FILE'},
{u'accessTime': 1539533046246,
u'blockSize': 134217728,
u'childrenNum': 0,
u'fileId': 16393,
u'group': u'supergroup',
u'length': 1073741824,
u'modificationTime': 1539533114772,
u'owner': u'mark',
u'pathSuffix': u'',
u'permission': u'644',
u'replication': 1,
u'storagePolicy': 0,
u'type': u'FILE'}]
Below I'll generate a SHA-256 hash of a file located on HDFS.
with client.read('/user/mark/one_gig') as reader:
print sha256(reader.read()).hexdigest()[:6]
ab5f97
Apache Arrow's HDFS Client
Apache Arrow is a cross-language platform for in-memory data headed by Wes McKinney. It's Python bindings "PyArrow" allows Python applications to interface with a C++-based HDFS client.
Wes stands out in the data world. He has worked for Cloudera in the past, created the Pandas Python package and has been a contributor to the Apache Parquet project.
The following will install PyArrow on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.
$ sudo apt install \
python \
python-pip \
virtualenv
$ virtualenv .pyarrow
$ source .pyarrow/bin/activate
$ pip install pyarrow
The Python API behaves in a clean and intuitive manor.
$ python
from hashlib import sha256
import pyarrow as pa
hdfs = pa.hdfs.connect(host='localhost', port=9000)
with hdfs.open('/user/mark/one_gig', 'rb') as f:
print sha256(f.read()).hexdigest()[:6]
ab5f97
The interface is very performant as well. The following completed in 6 seconds. This is the fastest any client transfer this file on my test rig.
with hdfs.open('/user/mark/one_gig_4', 'wb') as f:
f.write(open('/home/mark/one_gig').read())
HDFS Fuse and GNU Core Utilities
One of the most intuitive ways to interact with HDFS for a newcomer could most likely be with the GNU Core Utilities file system functions. These can be run on an HDFS mount exposed via a file system fuse.
The following will install an HDFS fuse client on a fresh Ubuntu 16.04.2 LTS system using a Debian package from Cloudera's repository.
$ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O - \
| sudo apt-key add -
$ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list -O - \
| sudo tee /etc/apt/sources.list.d/cloudera.list
$ sudo apt update
$ sudo apt install hadoop-hdfs-fuse
$ sudo mkdir -p hdfs_mount
$ sudo hadoop-fuse-dfs \
dfs://127.0.0.1:9000 \
hdfs_mount
The following completed in 8.2 seconds.
$ cp one_gig hdfs_mount/one_gig_5
Regular file system commands run as expected. The following gives how much space has been used and is available.
$ df -h hdfs_mount/
Filesystem Size Used Avail Use% Mounted on
fuse_dfs 12G 4.9G 6.8G 42% /home/mark/hdfs_mount
The following shows how much disk space has been used per parent folder.
$ du -hs hdfs_mount/
4.8G hdfs_mount/
This will give a file listing by file size.
$ ls -lhS hdfs_mount/user/mark/
-rw-r--r-- 1 mark 99 1.0G Oct 14 09:03 one_gig
-rw-r--r-- 1 mark 99 1.0G Oct 14 09:05 one_gig_2
-rw-r--r-- 1 mark 99 1.0G Oct 14 09:10 one_gig_3
...