Working with the Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.

Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.

Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I'll walk through a few of these offerings.

If you'd like to setup an HDFS environment locally please see my Hadoop 3 Single-Node Install Guide (skip the steps for Presto and Spark). I also have posts that cover working with HDFS on AWS EMR and Google Dataproc.

The Apache Hadoop HDFS Client

The Apache Hadoop HDFS client is the most well-rounded HDFS CLI implementation. Virtually any API endpoint that has been built into HDFS can be interacted with using this tool.

For the release of Hadoop 3, considerable effort was put into reorganising the arguments of this tool. This is what they look like as of this writing.

$ hdfs

    Admin Commands:

cacheadmin           configure the HDFS cache
crypto               configure HDFS encryption zones
debug                run a Debug Admin to execute HDFS debug commands
dfsadmin             run a DFS admin client
dfsrouteradmin       manage Router-based federation
ec                   run a HDFS ErasureCoding CLI
fsck                 run a DFS filesystem checking utility
haadmin              run a DFS HA admin client
jmxget               get JMX exported values from NameNode or DataNode.
oev                  apply the offline edits viewer to an edits file
oiv                  apply the offline fsimage viewer to an fsimage
oiv_legacy           apply the offline fsimage viewer to a legacy fsimage
storagepolicies      list/get/set block storage policies

    Client Commands:

classpath            prints the class path needed to get the hadoop jar and the required libraries
dfs                  run a filesystem command on the file system
envvars              display computed Hadoop environment variables
fetchdt              fetch a delegation token from the NameNode
getconf              get config values from configuration
groups               get the groups which users belong to
lsSnapshottableDir   list all snapshottable dirs owned by the current user
snapshotDiff         diff two snapshots of a directory or diff the current directory contents with a snapshot
version              print the version

    Daemon Commands:

balancer             run a cluster balancing utility
datanode             run a DFS datanode
dfsrouter            run the DFS router
diskbalancer         Distributes data evenly among disks on a given node
httpfs               run HttpFS server, the HDFS HTTP Gateway
journalnode          run the DFS journalnode
mover                run a utility to move block replicas across storage types
namenode             run the DFS namenode
nfs3                 run an NFS version 3 gateway
portmap              run a portmap service
secondarynamenode    run the DFS secondary namenode
zkfc                 run the ZK Failover Controller daemon

The bulk of the disk access verbs most people familiar with Linux will recognise are kept under the dfs argument.

$ hdfs dfs

Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
        [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] [-v] [-x] <path> ...]
        [-expunge]
        [-find <path> ... <expression> ...]
        [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] {-n name | -d} [-e en] <path>]
        [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
        [-head <file>]
        [-help [cmd ...]]
        [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
        [-setfattr {-n name [-v value] | -x name} <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touchz <path> ...]
        [-truncate [-w] <length> <path> ...]
        [-usage [cmd ...]]

Notice how the top usage line doesn't mention hdfs dfs but instead hadoop fs. You'll find that either prefix will provide the same functionality if you're working with HDFS as an endpoint.

$ hdfs dfs -ls /

$ hadoop fs -ls /

A Golang-based HDFS Client

In 2014, Colin Marc started work on a Golang-based HDFS client. This tool has two major features that stand out to me. The first is that there is no JVM overhead so execution begins much quicker than the JVM-based client. Second is the arguments align more closely with the GNU Core Utilities commands like ls and cat. This isn't a drop-in replacement for the JVM-based client but it should be a lot more intuitive for those already familiar with GNU Core Utilities file system commands.

The following will install the client on a fresh Ubuntu 16.04.2 LTS system.

$ wget -c -O gohdfs.tar.gz \
    https://github.com/colinmarc/hdfs/releases/download/v2.0.0/gohdfs-v2.0.0-linux-amd64.tar.gz
$ tar xvf gohdfs.tar.gz
$ gohdfs-v2.0.0-linux-amd64/hdfs

The release also includes a bash completion script. This is handy for being able to hit tab and get a list of commands or to complete a partially typed-out list of arguments.

I'll include the extracted folder name below to help differentiate this tool from the Apache HDFS CLI.

$ gohdfs-v2.0.0-linux-amd64/hdfs

Valid commands:
  ls [-lah] [FILE]...
  rm [-rf] FILE...
  mv [-nT] SOURCE... DEST
  mkdir [-p] FILE...
  touch [-amc] FILE...
  chmod [-R] OCTAL-MODE FILE...
  chown [-R] OWNER[:GROUP] FILE...
  cat SOURCE...
  head [-n LINES | -c BYTES] SOURCE...
  tail [-n LINES | -c BYTES] SOURCE...
  du [-sh] FILE...
  checksum FILE...
  get SOURCE [DEST]
  getmerge SOURCE DEST
  put SOURCE DEST
  df [-h]

As you can see, prefixing many GNU Core Utilities file system commands with the HDFS client will produce the expected functionality on HDFS.

$ gohdfs-v2.0.0-linux-amd64/hdfs df -h

Filesystem  Size  Used Available  Use%
           11.7G 24.0K      7.3G 0%

The GitHub homepage for this project shows how listing files can be two orders of magnitude quicker using this tool versus the JVM-based CLI.

This start up speed improvement can be handy if HDFS commands are being invoked a lot. The ideal file size of an ORC or Parquet file for most purposes is somewhere between 256 MB and 2 GB and it's not uncommon to see these being micro-batched into HDFS as they're being generated.

Below I'll generate a file containing a gigabyte of random data.

$ cat /dev/urandom \
    | head -c 1073741824 \
    > one_gig

Uploading this file via the JVM-based CLI took 18.6 seconds on my test rig.

$ hadoop fs -put one_gig /one_gig

Uploading via the Golang-based CLI took 13.2 seconds.

$ gohdfs-v2.0.0-linux-amd64/hdfs put one_gig /one_gig_2

Spotify's Python-based HDFS Client

In 2014 work began at Spotify on a Python-based HDFS CLI and library called Snakebite. The bulk of commits on this project we put together by Wouter de Bie and Rafal Wojdyla. If you don't require Kerberos support then the only requirements for this client are the Protocol Buffers Python library from Google and Python 2.7. As of this writing Python 3 isn't supported.

The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.

$ sudo apt install \
    python \
    python-pip \
    virtualenv

$ virtualenv .snakebite
$ source .snakebite/bin/activate
$ pip install snakebite

This client is not a drop-in replacement for the JVM-based CLI but shouldn't have a steep learning curve if you're already familiar with GNU Core Utilities file system commands.

$ snakebite

snakebite [general options] cmd [arguments]
general options:
  -D --debug                     Show debug information
  -V --version                   Hadoop protocol version (default:9)
  -h --help                      show help
  -j --json                      JSON output
  -n --namenode                  namenode host
  -p --port                      namenode RPC port (default: 8020)
  -v --ver                       Display snakebite version

commands:
  cat [paths]                    copy source paths to stdout
  chgrp <grp> [paths]            change group
  chmod <mode> [paths]           change file mode (octal)
  chown <owner:grp> [paths]      change owner
  copyToLocal [paths] dst        copy paths to local file system destination
  count [paths]                  display stats for paths
  df                             display fs stats
  du [paths]                     display disk usage statistics
  get file dst                   copy files to local file system destination
  getmerge dir dst               concatenates files in source dir into destination local file
  ls [paths]                     list a path
  mkdir [paths]                  create directories
  mkdirp [paths]                 create directories and their parents
  mv [paths] dst                 move paths to destination
  rm [paths]                     remove paths
  rmdir [dirs]                   delete a directory
  serverdefaults                 show server information
  setrep <rep> [paths]           set replication factor
  stat [paths]                   stat information
  tail path                      display last kilobyte of the file to stdout
  test path                      test a path
  text path [paths]              output file in text format
  touchz [paths]                 creates a file of zero length
  usage <cmd>                    show cmd usage

The client is missing certain verbs that can be found in the JVM-based client as well as the Golang-based client described above. One of which is the ability to copy files and streams to HDFS.

That being said I do appreciate how easy it is to pull statistics for a given file.

$ snakebite stat /one_gig

access_time             1539530885694
block_replication       1
blocksize               134217728
file_type               f
group                   supergroup
length                  1073741824
modification_time       1539530962824
owner                   mark
path                    /one_gig
permission              0644

To collect the same information with the JVM client would involve several commands. Their output would also be harder to parse than the key-value pairs above.

As well as being a CLI tool, Snakebite is also a Python library.

$ python

from snakebite.client import Client


client = Client("localhost", 9000, use_trash=False)
[x for x in client.ls(['/'])][:2]

[{'access_time': 1539530885694L,
  'block_replication': 1,
  'blocksize': 134217728L,
  'file_type': 'f',
  'group': u'supergroup',
  'length': 1073741824L,
  'modification_time': 1539530962824L,
  'owner': u'mark',
  'path': '/one_gig',
  'permission': 420},
 {'access_time': 1539531288719L,
  'block_replication': 1,
  'blocksize': 134217728L,
  'file_type': 'f',
  'group': u'supergroup',
  'length': 1073741824L,
  'modification_time': 1539531307264L,
  'owner': u'mark',
  'path': '/one_gig_2',
  'permission': 420}]

Note I've asked to connect to localhost on TCP port 9000. Out of the box Hadoop uses TCP port 8020 for the NameNode RPC endpoint. I've often changed this to TCP port 9000 in many of my Hadoop guides.

You can find the hostname and port number configured for this end point on the master HDFS node. Also note that for various reasons HDFS, and Hadoop in general, need to use hostnames rather than IP addresses.

$ sudo vi /opt/hadoop/etc/hadoop/core-site.xml

...
<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
</property>
...

Python-based HdfsCLI

In 2014, Matthieu Monsch also began work on a Python-based HDFS client called HdfsCLI. Two features this client has over the Spotify Python client is that it supports uploading to HDFS and Python 3 (in addition to 2.7).

Matthieu has previously worked at LinkedIn and now works for Google. The coding style of this project will feel very familiar to anyone that's looked at a Python project that has originated from Google.

This library includes support for a progress tracker, a fast AVRO library, Kerberos and Pandas DataFrames.

The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.

$ sudo apt install \
    python \
    python-pip \
    virtualenv

$ virtualenv .hdfscli
$ source .hdfscli/bin/activate
$ pip install 'hdfs[dataframe,avro]'

In order for this library to communicate with HDFS, WebHDFS needs to be enabled on the master HDFS node.

$ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hdfs/datanode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hdfs/namenode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>localhost:50070</value>
    </property>
</configuration>

With the configuration in place the DFS service needs to be restarted.

$ sudo su
$ /opt/hadoop/sbin/stop-dfs.sh
$ /opt/hadoop/sbin/start-dfs.sh
$ exit

A configuration file is needed to store connection settings for this client.

$ vi ~/.hdfscli.cfg

[global]
default.alias = dev

[dev.alias]
url = http://localhost:50070
user = mark

A big downside of the client is that only a very limited subset of HDFS functionality is supported. That being said the client verb arguments are pretty self explanatory.

$ hdfscli --help

HdfsCLI: a command line interface for HDFS.

Usage:
  hdfscli [interactive] [-a ALIAS] [-v...]
  hdfscli download [-fsa ALIAS] [-v...] [-t THREADS] HDFS_PATH LOCAL_PATH
  hdfscli upload [-sa ALIAS] [-v...] [-A | -f] [-t THREADS] LOCAL_PATH HDFS_PATH
  hdfscli -L | -V | -h

Commands:
  download                      Download a file or folder from HDFS. If a
                                single file is downloaded, - can be
                                specified as LOCAL_PATH to stream it to
                                standard out.
  interactive                   Start the client and expose it via the python
                                interpreter (using iPython if available).
  upload                        Upload a file or folder to HDFS. - can be
                                specified as LOCAL_PATH to read from standard
                                in.

Arguments:
  HDFS_PATH                     Remote HDFS path.
  LOCAL_PATH                    Path to local file or directory.

Options:
  -A --append                   Append data to an existing file. Only supported
                                if uploading a single file or from standard in.
  -L --log                      Show path to current log file and exit.
  -V --version                  Show version and exit.
  -a ALIAS --alias=ALIAS        Alias of namenode to connect to.
  -f --force                    Allow overwriting any existing files.
  -s --silent                   Don't display progress status.
  -t THREADS --threads=THREADS  Number of threads to use for parallelization.
                                0 allocates a thread per file. [default: 0]
  -v --verbose                  Enable log output. Can be specified up to three
                                times (increasing verbosity each time).

Examples:
  hdfscli -a prod /user/foo
  hdfscli download features.avro dat/
  hdfscli download logs/1987-03-23 - >>logs
  hdfscli upload -f - data/weights.tsv <weights.tsv

HdfsCLI exits with return status 1 if an error occurred and 0 otherwise.

The following finished in 68 seconds. This is an order of magnitude slower than some other clients I'll explore in this post.

$ hdfscli upload \
          -t 4 \
          -f - \
          one_gig_3 < one_gig

That said, the client is very easy to work with in Python.

$ python

from hashlib import sha256

from hdfs import Config


client = Config().get_client('dev')
[client.status(uri)
 for uri  in client.list('')][:2]

[{u'accessTime': 1539532953515,
  u'blockSize': 134217728,
  u'childrenNum': 0,
  u'fileId': 16392,
  u'group': u'supergroup',
  u'length': 1073741824,
  u'modificationTime': 1539533029897,
  u'owner': u'mark',
  u'pathSuffix': u'',
  u'permission': u'644',
  u'replication': 1,
  u'storagePolicy': 0,
  u'type': u'FILE'},
 {u'accessTime': 1539533046246,
  u'blockSize': 134217728,
  u'childrenNum': 0,
  u'fileId': 16393,
  u'group': u'supergroup',
  u'length': 1073741824,
  u'modificationTime': 1539533114772,
  u'owner': u'mark',
  u'pathSuffix': u'',
  u'permission': u'644',
  u'replication': 1,
  u'storagePolicy': 0,
  u'type': u'FILE'}]

Below I'll generate a SHA-256 hash of a file located on HDFS.

with client.read('/user/mark/one_gig') as reader:
    print sha256(reader.read()).hexdigest()[:6]

ab5f97

Apache Arrow's HDFS Client

Apache Arrow is a cross-language platform for in-memory data headed by Wes McKinney. It's Python bindings "PyArrow" allows Python applications to interface with a C++-based HDFS client.

Wes stands out in the data world. He has worked for Cloudera in the past, created the Pandas Python package and has been a contributor to the Apache Parquet project.

The following will install PyArrow on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment.

$ sudo apt install \
    python \
    python-pip \
    virtualenv

$ virtualenv .pyarrow
$ source .pyarrow/bin/activate
$ pip install pyarrow

The Python API behaves in a clean and intuitive manor.

$ python

from   hashlib import sha256

import pyarrow as pa


hdfs = pa.hdfs.connect(host='localhost', port=9000)

with hdfs.open('/user/mark/one_gig', 'rb') as f:
    print sha256(f.read()).hexdigest()[:6]

ab5f97

The interface is very performant as well. The following completed in 6 seconds. This is the fastest any client transfer this file on my test rig.

with hdfs.open('/user/mark/one_gig_4', 'wb') as f:
    f.write(open('/home/mark/one_gig').read())

HDFS Fuse and GNU Core Utilities

One of the most intuitive ways to interact with HDFS for a newcomer could most likely be with the GNU Core Utilities file system functions. These can be run on an HDFS mount exposed via a file system fuse.

The following will install an HDFS fuse client on a fresh Ubuntu 16.04.2 LTS system using a Debian package from Cloudera's repository.

$ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/archive.key -O - \
    | sudo apt-key add -
$ wget https://archive.cloudera.com/cdh5/ubuntu/xenial/amd64/cdh/cloudera.list -O - \
    | sudo tee /etc/apt/sources.list.d/cloudera.list
$ sudo apt update
$ sudo apt install hadoop-hdfs-fuse

$ sudo mkdir -p hdfs_mount
$ sudo hadoop-fuse-dfs \
    dfs://127.0.0.1:9000 \
    hdfs_mount

The following completed in 8.2 seconds.

$ cp one_gig hdfs_mount/one_gig_5

Regular file system commands run as expected. The following gives how much space has been used and is available.

$ df -h hdfs_mount/

Filesystem      Size  Used Avail Use% Mounted on
fuse_dfs         12G  4.9G  6.8G  42% /home/mark/hdfs_mount

The following shows how much disk space has been used per parent folder.

$ du -hs hdfs_mount/

4.8G    hdfs_mount/

This will give a file listing by file size.

$ ls -lhS hdfs_mount/user/mark/

-rw-r--r-- 1 mark 99 1.0G Oct 14 09:03 one_gig
-rw-r--r-- 1 mark 99 1.0G Oct 14 09:05 one_gig_2
-rw-r--r-- 1 mark 99 1.0G Oct 14 09:10 one_gig_3
...

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.