Home | Benchmarks | Archives | Atom Feed

Posted on Fri 01 January 2016

Hadoop Up and Running

In this blog post I will walk you through three ways of settings up Hadoop. Each method will have it's pros and cons and in some cases, somewhat narrow use cases.

Installing Hadoop via Debian Packages

If you are working on building Map Reduce jobs packaged as JAR files and want to run them locally then installing a single-node setup on your local system can be a quick way of getting everything going.

Ubuntu out of the box does not have a Hadoop package that can be installed via apt-get but there is a PPA archive with various Hadoop tools and boilerplate configurations bundled together. As of this writing only Ubuntu 11, 12 and 14 are supported. The installation below was run on Ubuntu 14.04 LTS which will be supported until April 2019. To contrast, Ubuntu 15.10's support will end in July 2016.

To start, add the repository for hadoop-ubuntu:

$ sudo add-apt-repository ppa:hadoop-ubuntu/stable
 Hadoop Stable packages

These packages are based on Apache Bigtop with appropriate patches to enable native integration on Ubuntu Oneiric onwards and for ARM based archictectures.

Please report bugs here - https://bugs.launchpad.net/hadoop-ubuntu-packages/+filebug
 More info: https://launchpad.net/~hadoop-ubuntu/+archive/ubuntu/stable
Press [ENTER] to continue or ctrl-c to cancel adding it

gpg: keyring `/tmp/tmpe8tsd3jf/secring.gpg' created
gpg: keyring `/tmp/tmpe8tsd3jf/pubring.gpg' created
gpg: requesting key 84FBAFF0 from hkp server keyserver.ubuntu.com
gpg: /tmp/tmpe8tsd3jf/trustdb.gpg: trustdb created
gpg: key 84FBAFF0: public key "Launchpad PPA for Hadoop Ubuntu Packagers" imported
gpg: Total number processed: 1
gpg:               imported: 1  (RSA: 1)
OK

Retrieve the new packages that are associated with hadoop-ubuntu and then install OpenJDK 7 development kit and the Hadoop package:

$ sudo apt-get update
$ sudo apt-get install \
    openjdk-7-jdk \
    hadoop

Adjusting the Boilerplate Configuration

There are four configuration packages that will need to be adjusted before you can format your name node. Three of these are XML files and the remaining file is a shell script with Hadoop's environment variables. These will be used when you run any Hadoop jobs.

Core Site Configuration:

$ sudo vi /etc/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/tmp</value>
      <description>A base for other temporary directories.</description>
    </property>

    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>
</configuration>

Map Reduce Site Configuration:

$ sudo vi /etc/hadoop/conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:54311</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
      </description>
    </property>
</configuration>

HDFS Site Configuration:

$ sudo vi /etc/hadoop/conf/hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
      <name>dfs.replication</name>
      <value>1</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
    </property>
</configuration>

Hadoop Environment Configuration:

$ sudo vi /etc/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"

Format the Hadoop File System

The following will create the storage system Hadoop will use to read and write files. If this is done on more than one machine the storage system can be referred to as distributed.

$ sudo hadoop namenode -format
... namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
************************************************************/
... util.GSet: VM type       = 64-bit
... util.GSet: 2% max memory = 19.33375 MB
... util.GSet: capacity      = 2^21 = 2097152 entries
... util.GSet: recommended=2097152, actual=2097152
... namenode.FSNamesystem: fsOwner=root
... namenode.FSNamesystem: supergroup=supergroup
... namenode.FSNamesystem: isPermissionEnabled=true
... namenode.FSNamesystem: dfs.block.invalidate.limit=100
... namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
... namenode.NameNode: Caching file names occuring more than 10 times
... common.Storage: Image file of size 110 saved in 0 seconds.
... common.Storage: Storage directory /tmp/dfs/name has been successfully formatted.
... namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

The key piece of information you're looking for is:

Storage directory /tmp/dfs/name has been successfully formatted.

If you don't see this then something has gone wrong.

Setting up Hadoop's SSH access

Even though this is a single-node instance Hadoop expects to SSH into each machine, whether it's logical or physical, to conduct it's operations. When doing so it'll need full permissions and expects to be doing so via the root user.

The following confirms that sshd allows root to shell into the machine without a password.

$ grep PermitRootLogin /etc/ssh/sshd_config
PermitRootLogin without-password

Then, assuming root doesn't yet have a private and public key pair, a key pair needs to be generated. The public key then can be added to the authorized keys list so the root can ssh into the server using it's own account.

$ sudo su
root@ubuntu:~# ssh-keygen
root@ubuntu:~# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Launching Hadoop's Processes

With the above all in place you can run a shell script that will launch the various services needed for a functioning Hadoop instance.

$ sudo /usr/lib/hadoop/bin/start-all.sh
starting namenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/lib/hadoop/libexec/../logs/hadoop-root-tasktracker-ubuntu.out

If you run the JVM process status tool you should see the various nodes and trackers up and running:

$ sudo jps
19892 TaskTracker
19724 JobTracker
19295 NameNode
19456 DataNode
19627 SecondaryNameNode
20035 Jps

Running a Hadoop Job

Hadoop comes with a number of interesting example jobs which are well suited use cases to the functionality it offers. The following example is a job which tries to estimate the value of Pi in a distributable manor using a quasi-Monte Carlo / "dart-throwing" method. I came across this YouTube video which gives a short explanation of how it works.

There are two key parameters being passed to this job, the number of maps (2 in this case) and the number of samples per map (1000). You can add and remove samples to affect the number of darts being thrown (and thus affect the final estimated value) and you can raise and lower the number of maps to affect the performance and/or overhead of running this job.

$ sudo /usr/lib/hadoop/bin/hadoop jar \
       /usr/lib/hadoop/hadoop-examples-1.0.3.jar \
       pi 2 1000

On top of getting the status of the job as it's executing you'll receive telemetry regarding the job in the form of counters. These can be used to help determine where bottlenecks exist and help you tune for better performance.

Number of Maps  = 2
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Starting Job
... FileInputFormat: Total input paths to process : 2
... JobClient: Running job: job_201512291551_0004
... JobClient:  map 0% reduce 0%
... JobClient:  map 100% reduce 0%
... JobClient:  map 100% reduce 100%
... JobClient: Job complete: job_201512291551_0004
... JobClient: Counters: 30
... JobClient:   Job Counters
... JobClient:     Launched reduce tasks=1
... JobClient:     SLOTS_MILLIS_MAPS=19533
... JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
... JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
... JobClient:     Launched map tasks=2
... JobClient:     Data-local map tasks=2
... JobClient:     SLOTS_MILLIS_REDUCES=10268
... JobClient:   File Input Format Counters
... JobClient:     Bytes Read=236
... JobClient:   File Output Format Counters
... JobClient:     Bytes Written=97
... JobClient:   FileSystemCounters
... JobClient:     FILE_BYTES_READ=50
... JobClient:     HDFS_BYTES_READ=480
... JobClient:     FILE_BYTES_WRITTEN=64683
... JobClient:     HDFS_BYTES_WRITTEN=215
... JobClient:   Map-Reduce Framework
... JobClient:     Map output materialized bytes=56
... JobClient:     Map input records=2
... JobClient:     Reduce shuffle bytes=56
... JobClient:     Spilled Records=8
... JobClient:     Map output bytes=36
... JobClient:     Total committed heap usage (bytes)=335749120
... JobClient:     CPU time spent (ms)=900
... JobClient:     Map input bytes=48
... JobClient:     SPLIT_RAW_BYTES=244
... JobClient:     Combine input records=0
... JobClient:     Reduce input records=4
... JobClient:     Reduce input groups=4
... JobClient:     Combine output records=0
... JobClient:     Physical memory (bytes) snapshot=415301632
... JobClient:     Reduce output records=0
... JobClient:     Virtual memory (bytes) snapshot=3254882304
... JobClient:     Map output records=4

Once the job is finished the estimation is written to stdout:

Job Finished in 31.394 seconds
Estimated value of Pi is 3.14400000000000000000

Bigtop: Run a Dockerized-Hadoop Cluster

The Hadoop ecosystem has a large number of individual tools within it. Much like UNIX, each tool does one specific thing very well. Though most of Hadoop toolings' scope isn't exactly as small as what you'd find with the various pieces of software in GNU core utilities for example, the philosophy persists.

When the Linux community began there was a need to get a large number of tools together and integrate them in a way that would provide a functioning system, thus Linux Distros were produced. Similar concepts exist in the Hadoop ecosystems in the form of pre-packaged virtual machines. But often these can have opaque histories as to how exactly they were pieced together. Being able to not only have a system where you use it to help you accomplish your work but also allow you to examine exactly how the tools and configurations were pieced together and run tests against it is extremely educational.

Apache Bigtop looks to address these needs. It aims to incorporate a wide variety of Hadoop tooling rather than one or a few packages. Out of the box you can setup a Hadoop cluster via Puppet scripts in Docker containers and interact with them via Vagrant. I've tested the setup successfully on numerous occasions using an Ubuntu 15 host.

Please note: there is a lot to Bigtop and I will only be scratching the surface of it's functionality in this blog post.

Installing Bigtop's dependencies

To start, the following will setup the dependencies needed to build and run the Docker containers which will act as a Hadoop cluster.

Docker is installed via a 3rd-party repository so we'll need to add it's repository's credentials and details first.

$ sudo apt-get install software-properties-common
$ sudo apt-key adv \
    --keyserver hkp://p80.pool.sks-keyservers.net:80 \
    --recv-keys 58118E89F3A912897C070ADBF76221572C52609D
$ sudo add-apt-repository "deb https://apt.dockerproject.org/repo ubuntu-wily main"
$ sudo apt-get update

If there happens to be any existing Docker installation on the system make sure it is removed to avoid any version conflicts:

$ sudo apt-get purge lxc-docker

You may need extra drivers which are left out of the base Kernel package that comes with Ubuntu. I found with Ubuntu 15 this wasn't needed but if the following commands does install any then restart your box after the installation completes.

$ sudo apt-get install linux-image-extra-$(uname -r)

If you see the following then no extra drivers were installed and you can continue on without a reboot:

0 to upgrade, 0 to newly install, 0 to remove and 83 not to upgrade.

The following will install Docker, Virtualbox and Vagrant:

$ sudo apt-get install \
    docker-engine \
    virtualbox \
    vagrant
$ sudo service docker start

Pick your packages

Now that the requirements are installed we can checkout the master branch of the Bigtop repository and adjust the list of components we wish to install:

$ git clone https://github.com/apache/bigtop.git
$ cd bigtop/bigtop-deploy/vm/vagrant-puppet-docker/
$ vi vagrantconfig.yaml

By default the Hadoop and YARN components are installed but Bigtop supports a larger number of packages.

This is the default component list:

components: [hadoop, yarn]

If you change the component list to this:

components: [all]

Then, among others, Crunch, Datafu, Flume, Giraph, Hama, HBase, Hive, Hue, Ignite, Kafka, Kite, Mahout, Oozie, Phoenix, Pig, Solr, Spark, Sqoop 1 and 2, Tachyon, Tez, Zeppelin and Zookeeper will be installed.

Build your containers

The following will pull down a CentOS 6 base for each of the Docker containers to use and create a 3-node Hadoop cluster. There is a lot of packaging that will be pulled down and installed so this could take a while. When I last ran this it took 90 minutes to complete.

$ sudo docker pull bigtop/deploy:centos-6
$ sudo ./docker-hadoop.sh --create 3

Once that is complete you should see three machines running:

$ sudo vagrant status
Current machine states:

bigtop1                   running (docker)
bigtop2                   running (docker)
bigtop3                   running (docker)

Running a job on your new cluster

With your cluster running you can SSH into the master node.

$ sudo vagrant ssh bigtop1

Before anything else confirm that you can see a functioning Hadoop file system:

$ hadoop fs -ls /

If you see a list of directories and no ugly stack traces then you're good to run a Hadoop job:

$ hadoop jar \
    /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
    pi 10 1000000

Here is the output I saw when I ran this:

...
Estimated value of Pi is 3.14158440000000000000

If you want to destroy the cluster exit out of the master node and running the following:

$ sudo ./docker-hadoop.sh --destroy
==> bigtop3: Stopping container...
==> bigtop3: Deleting the container...
==> bigtop2: Stopping container...
==> bigtop2: Deleting the container...
==> bigtop1: Stopping container...
==> bigtop1: Deleting the container...
removed ˜./hosts
removed ˜./config.rb

Amazon's Elastic Map Reduce (EMR)

Amazon's Elastic Map Reduce service allows you to setup pre-configured Hadoop clusters using cheap machines that you can pay for by the hour. Here I will walk through setting up a cluster that consists of 3 nodes.

To follow along with this example you'll need an Amazon Web Services account, an access key and secret key that you'll get once you've signed up and a key pair file that you can generate by going to the EC2 dashboard -> Network & Security -> Key Pairs. Once you've generated a key pair a .pem file will be downloaded onto your machine. You'll need to make sure it's permissions are set to 600 at the most in order for the various tools to accept working with it.

Set AWS Credentials and Install Dependencies

The following was all run on an Ubuntu 15 machine.

To start I'll set the AWS credentials environment variables. I use the read command so that the credentials won't sit in my bash history:

$ read AWS_ACCESS_KEY_ID
$ read AWS_SECRET_ACCESS_KEY
$ export AWS_ACCESS_KEY_ID
$ export AWS_SECRET_ACCESS_KEY

After that I'll install curl and some tools used to run Python-based programs:

$ sudo apt-get install \
    python-pip \
    python-virtualenv \
    curl

I want to run the Pi estimation job on EMR so I'll download Hadoop and pluck out the Map Reduce examples JAR file:

$ curl -O http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.3/hadoop-2.6.3.tar.gz
$ tar zxf hadoop-2.6.3.tar.gz
$ cp hadoop-2.6.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.3.jar ./

I'll be using a tool called mrjob that allows me to define Map Reduce jobs using a small amount of Python code and a YAML configuration file. To start I'll create a virtual environment that mrjob, the AWS cli tool and s3cmd will be installed into:

$ virtualenv test
$ source test/bin/activate
$ pip install \
    mrjob \
    awscli \
    s3cmd

Then I'll create a small Python script that will define the step(s) my Map Reduce job will run:

$ vi calc_pi_job.py
from mrjob.job import MRJob
from mrjob.step import JarStep


class CalcPiJob(MRJob):

    def steps(self):
        return [JarStep(jar='hadoop-mapreduce-examples-2.6.3.jar',
                        args=['pi', '10', '100000'])]


if __name__ == '__main__':
    CalcPiJob.run()

MRJob expects a manifest of source file locations to be included but in this case we're only running a computational exercise so I'll create an empty manifest file.

$ touch none

Bargain Hunting for EC2 Instances

I need to tell mrjob what kind of instances I want to run in my cluster and the spot prices I'm willing to pay Amazon for those instances.

$ vi mrjob.conf
runners:
  emr:
    aws_region: us-east-1
    ec2_master_instance_type: c3.xlarge
    ec2_master_instance_bid_price: '0.05'
    ec2_instance_type: c3.xlarge
    ec2_core_instance_bid_price: '0.05'
    num_ec2_instances: 2
    ec2_key_pair_file: emr.pem

Spot prices on Amazon Web Services can be ~80% cheaper than on-demand prices but they're not guaranteed to be accepted. If there is a lot of demand for the same instances in the availability zone you're wishing to operate in then you'll need increase your bid. The instances can be switched off at any time with a 60-second notice if your maximum bid no longer is competitive with the current demand levels.

Hadoop is built to handle non-master nodes disappearing so it's not the end of the world if one or two task runners in a cluster disappear.

In the above example I'm bidding to pay a maximum of $0.05 / hour for 3 c3.xlarge instances. These instances each come with 4 virtual CPUs, 7.5 GB of memory and 2 x 40 GB SSD drives. The on-demand price is $0.239 an hour. If my bid is accepted $0.05 / hour per instance will be maximum I pay but the final price could be even lower than that. In the past I've had 200, much more powerful instances running all day (and night in some cases) for under $10 / hour in total.

There will also be a small overhead cost for running via EMR as it saves you a lot of setup time. I've found this cost to be around 30% on top of the spot instances combined cost.

If you are using this service as a personal user in Europe you can expect to pay VAT on top of your net cost (20% in the UK, 23% in Ireland, etc...). If you're a business with a VAT number you can have Amazon charge you at a 0% VAT rate if it's permissible to do so in your tax region.

Finally, there is a limit per account to the number of instances you can launch. My personal account is limited to 50 instances. I've had clients who have requested and have been granted permission to provision 100s of instances. I know of a firm who was declined the permission to launch 1000 GPU-based instances.

Amazon have kept secret the underlying factors that determines what minimum price you need to bid to keep your spot instances running. Some people are trying to uncover those factors.

Executing a Hadoop Job on EMR

The following will trigger the Map Reduce Job:

$ python calc_pi_job.py \
    -r emr \
    --conf-path mrjob.conf \
    --output-dir s3://<throw away s3 bucket>/test1/ \
    none

A number of things will happen when this script runs:

  • An S3 bucket will be created where log files from most services within the cluster will be stored in a compressed format. If you're running 40-50 instances these compressed files can easily be 100's of MB in size.
  • A cluster ID will be assigned (this is called a Job flow ID).
  • If the bid price is accepted the EC2 instances will be provisioned. If you're setting up 3 instances or 200 this always seems to be a 5 - 15 minute job when running in Amazon's Virginia-based us-east-1 region.
  • Any software that's needed for running the job will be installed in a bootstrapping process.
  • Your job itself will be run.
  • If the output of your job generates files to sit on S3 or outputs a single value to stdout on the master node will depend on how you've written your job. In this case it'll output the estimated value of Pi to stdout on the master node. This information should be stored in an stdout.gz file and shipped to S3.
  • The cluster will then be terminated so that I only pay for the full hours of each instance you used. The following job was done in about 12 minutes so we had to pay for the first hour of each instance (3 instance hours) and spent about $0.16 in total when EMR's service cost is included.

The output from the above job will look something like the following:

creating new scratch bucket mrjob-3568101f09d5f75c
using s3://mrjob-3568101f09d5f75c/tmp/ as our scratch dir on S3
creating tmp directory /tmp/calc_pi_job.mark.20151229.160731.659742
writing master bootstrap script to /tmp/calc_pi_job.mark.20151229.160731.659742/b.py
creating S3 bucket 'mrjob-3568101f09d5f75c' to use as scratch space
Copying non-input files into s3://mrjob-3568101f09d5f75c/tmp/calc_pi_job.mark.20151229.160731.659742/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Auto-created instance profile mrjob-5cc7d6cec347fa81
Auto-created service role mrjob-84fd09862fa415d0
Job flow created with ID: j-3B6TE1AW5RVS5
Created new job flow j-3B6TE1AW5RVS5
Job launched 31.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 62.5s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 93.7s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 125.0s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 156.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 187.5s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 218.7s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 250.0s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 281.2s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 312.4s ago, status STARTING: Provisioning Amazon EC2 capacity
Job launched 343.7s ago, status STARTING: Configuring cluster software
Job launched 374.9s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 406.2s ago, status BOOTSTRAPPING: Running bootstrap actions
Job launched 437.5s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1)
Job launched 468.8s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1)
Job launched 500.0s ago, status RUNNING: Running step (calc_pi_job.mark.20151229.160731.659742: Step 1 of 1)
Job completed.
Running time was 74.0s (not counting time spent waiting for the EC2 instances)
ec2_key_pair_file not specified, going to S3
Fetching counters from S3...
Waiting 5.0s for S3 eventual consistency
Counters from step 1:
  (no counters found)
Streaming final output from s3://<throw away s3 bucket>/test1/
removing tmp directory /tmp/calc_pi_job.mark.20151229.160731.659742
Removing all files in s3://mrjob-3568101f09d5f75c/tmp/calc_pi_job.mark.20151229.160731.659742/
Removing all files in s3://mrjob-3568101f09d5f75c/tmp/logs/j-3B6TE1AW5RVS5/
Terminating job flow: j-3B6TE1AW5RVS5

Now normally you can then download the log files and grep out the estimated value of Pi:

$ s3cmd get --recursive s3://mrjob-3568101f09d5f75c/tmp/logs/
$ cd j-3B6TE1AW5RVS5
$ find . -type f -name '*.gz' -exec gunzip "{}" \;
$ grep -r 'Estimated value of Pi is' * | wc -l
0

While the estimated value of Pi was printed to the stdout log file on the master node and that file should have been shipped to S3 it wasn't for some reason. If the cluster is not set to auto-terminate you can SSH in and see the value sitting in the stdout log file on the master node itself:

$ aws emr ssh \
    --key-pair-file emr.pem \
    --region us-east-1 \
    --cluster-id j-3B6TE1AW5RVS5
$ cat /mnt/var/log/hadoop/steps/s-3QSFO02HYEA8/stdout
Number of Maps  = 10
Samples per Map = 100000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 186.566 seconds
Estimated value of Pi is 3.14155200000000000000

I've discussed ways of trying to fix or work around this issue in a ticket I raised with the mrjob community on GitHub. Our efforts are ongoing.

Thank you for taking the time to read this post. I offer consulting, architecture and hands-on development services to clients in Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2017 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.