Home | Benchmarks | Categories | Atom Feed

Posted on Tue 10 August 2021 under File and Data Management

MinIO: A Bare Metal Drop-In for AWS S3

In 2006, AWS launched S3, a distributed object storage service on their newly-launched Cloud platform. Up until this point, developers often interacted with storage on physical devices using POSIX-compatible file system operations. These "local" storage offerings had file systems, stored data in blocks and often had a fixed amount of metadata describing any one file. UNIX was built on the idea that "everything is a file".

S3 isn't a regular file system, it's a distributed object storage service with an API that was distinctive for its time it quickly grew in popularity among developers.

S3 didn't require lengthy storage capacity planning. The cost of storage and access often beat the cost of buying storage hardware and paying staff to manage it. Data durability of 99.999999999% was unheard of at the time. Objects could also contain more arbitrary metadata properties than the standard metadata you'd find with exFAT or the EXT4 filesystem.

In the 15 years since a large amount of software has been written that interacts with AWS S3 API. Other Cloud vendors have sprung up with their offerings, namely Azure Blob Storage and Google Cloud Storage, both of whom launched 4 years after S3 in 2010.

Other firms have built their own object storage technologies. Google's Colossus pre-dates AWS S3 and is still used today as the storage backend to YouTube's Procella database. Facebook built Haystack, then later f4 and recently unveiled the Tectonic Filesystem. Twitter built an in-house photo storage system, LinkedIn built their own geographically distributed object storage system called Ambry and Dropbox built Magic Pocket. Unfortunately, none of the above are available to the public, let alone something outsiders can run on their own hardware.

I've worked for financial services firms in the past where client data wasn't allowed to live on hardware that wasn't owned and operated by the bank. During these engagements, there were a lot of aspects of Cloud-based object storage I missed.

Recently I started looking into open source object storage projects and I came across MinIO. Among its long list of features, it offers an S3 gateway service that can allow you to expose Hadoop's distributed file system (HDFS) with an AWS S3-compatible interface.

MinIO is made up of 160K lines of GoLang code primarily by Harshavardhana. He worked on GlusterFS for years prior to starting MinIO, both before and after the RedHat acquisition.

Is this only useful for a handful of banks?

Most businesses are data businesses these days and most have at least one foot in the Cloud. It's becoming rarer for firms to operate their own hardware. That being said, there are enough businesses that are at least partially no-Cloud. Below are seven situations where I could see a business deciding against using a Cloud storage provider.

  1. Data transfer times are too much of an overhead. If an Airline has a fleet of 100 aircraft that produce 200 TB of telemetry each week and has poor network connectivity at its hub. An alternative reason could be their preventive maintenance software was architected around bare metal.
  2. Data needs to move between different entities frequently. AWS S3 egress costs $80 / TB, if you need to share data with a non-AWS storage provider, this overhead might become a burden.
  3. Data usage patterns that are not optimised for S3 or are outside of AWS' terms of use.
  4. The Economist wrote a piece a few years ago stating energy firms in Alberta, Canada were fearful of rent-seeking from Cloud providers and were holding back on certain Cloud investments as a result.
  5. Firms with a security architecture that doesn't allow data to be exposed to any internet-facing machines.
  6. Firms with existing hardware and/or platform provider agreements that are expensive and/or difficult to break.
  7. Firms that see object storage management as a core competency and have usage patterns that are less expensive to manage in-house.

HDFS Up & Running

I'll be installing HDFS and MinIO on a single machine running Ubuntu 20.04 LTS with 16 GB of RAM and 1 TB of SSD storage.

Hadoop 3.3.1 can only be built using JVM 8 but can run on either version 8 or 11. Given the push towards 11 within the Hadoop ecosystem, I'll be using 11 for this walk-through.

$ sudo apt install \
    openjdk-11-jre \
    openjdk-11-jdk-headless

I'll centralise the environment variables Hadoop relies on below.

$ sudo vi /etc/profile
if [ "$PS1" ]; then
  if [ "$BASH" ] && [ "$BASH" != "/bin/sh" ]; then
    # The file bash.bashrc already sets the default PS1.
    # PS1='\h:\w\$ '
    if [ -f /etc/bash.bashrc ]; then
      . /etc/bash.bashrc
    fi
  else
    if [ "`id -u`" -eq 0 ]; then
      PS1='# '
    else
      PS1='$ '
    fi
  fi
fi

if [ -d /etc/profile.d ]; then
  for i in /etc/profile.d/*.sh; do
    if [ -r $i ]; then
      . $i
    fi
  done
  unset i
fi

export HADOOP_HOME=/opt/hadoop
export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export JAVA_HOME=/usr

I'll make sure the root user is also using the same settings.

$ sudo ln -sf /etc/profile \
              /root/.bashrc
$ source /etc/profile

I'll set up folders for Hadoop's code and HDFS' file storage.

$ sudo mkdir -p /opt/{hadoop,hdfs/{datanode,namenode}}

I'll then download and extract Hadoop 3.3.1.

$ wget -c -O hadoop.tar.gz  https://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
$ sudo tar xvf hadoop.tar.gz \
      --directory=/opt/hadoop \
      --exclude=hadoop-3.3.1/share/doc \
      --strip 1

This is a single-node instance of Hadoop so I'll set the hostname lists to "localhost".

$ echo "localhost" | sudo tee /opt/hadoop/etc/hadoop/master
$ echo "localhost" | sudo tee /opt/hadoop/etc/hadoop/slaves

The following will set HDFS' Namenode TCP port to 9000.

$ sudo vi /opt/hadoop/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000/</value>
    </property>
    <property>
        <name>fs.default.FS</name>
        <value>hdfs://localhost:9000/</value>
    </property>
</configuration>

The following will point HDFS' storage to local folders on this system. The replication value will be set to 1 since there is only a single machine running HDFS.

$ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hdfs/datanode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hdfs/namenode</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.namenode.http-address</name>
        <value>localhost:50070</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

I'll switch to the root user and make sure it can SSH into its own account. HDFS uses SSH for running management commands.

$ sudo su
$ ssh-keygen
$ cp /root/.ssh/id_rsa.pub \
     /root/.ssh/authorized_keys
$ ssh localhost uptime

I'll then run the format command, launch HDFS and grant my UNIX account permissions to everything on HDFS.

$ hdfs namenode -format
$ start-dfs.sh
$ hdfs dfs -chown mark /

I'll then check the capacity is as expected before exiting the root user's shell.

$ hdfs dfsadmin -report \
    | grep 'Configured Capacity' \
    | tail -n1

$ exit

MinIO Up & Running

The following will download and install MinIO's server and client.

$ wget -c https://dl.min.io/client/mc/release/linux-amd64/mcli_20210727064619.0.0_amd64.deb
$ wget -c https://dl.min.io/server/minio/release/linux-amd64/minio_20210805220119.0.0_amd64.deb
$ sudo dpkg -i mcli_20210727064619.0.0_amd64.deb
$ sudo dpkg -i minio_20210805220119.0.0_amd64.deb

I'll then launch a screen and start the HDFS gateway service with "admin" as the root username and "password" as its password. The default TCP ports for MinIO is 9000 and 9001. TCP port 9000 is already being used by HDFS so I'll move MinIO's to 9900 and 9901 respectively.

$ screen
$ MINIO_ROOT_USER=admin \
  MINIO_ROOT_PASSWORD=password \
  minio gateway hdfs hdfs://namenode:9000 \
  --console-address ":9901" \
  --address ":9900"

Type Ctrl-A and then Ctrl-D to detach the screen.

I'll set up the CLI auto-completion and a "myminio" alias for the instance.

$ mcli --autocompletion
$ mcli alias set myminio http://127.0.0.1:9900 admin password

After the shell is restarted typing tab after mcli<space> will allow for typing out commands by hand with fewer keystrokes.

The following will create a new bucket and copy the hosts file into it.

$ mcli mb myminio/mynewbucket
$ mcli cp /etc/hosts myminio/mynewbucket/

I can then see the hosts file and read its contents.

$ mcli ls myminio/mynewbucket/
[2021-08-10 10:51:24 UTC]   224B hosts
$ mcli head -n 2 myminio/mynewbucket/hosts
127.0.0.1 localhost
127.0.1.1 localhost

The following gives me the amount of disk space being used by the bucket's contents.

$ mcli du myminio/mynewbucket
224B    mynewbucket

The contents of the bucket are also visible and readable via HDFS' client.

$ hdfs dfs -ls /mynewbucket/hosts
-rw-r--r--   1 mark supergroup        224 2021-08-10 10:51 /mynewbucket/hosts
$ hdfs dfs -cat /mynewbucket/hosts 2>/dev/null | head -n2
127.0.0.1 localhost
127.0.1.1 localhost

AWS CLI & Boto3 using MinIO

Both AWS CLI and Boto3 are written in Python. Below I'll install Python 3, set up a virtual environment and then install both packages.

$ sudo apt update
$ sudo apt install \
    python3-pip \
    python3-virtualenv
$ virtualenv ~/.venv
$ source ~/.venv/bin/activate
$ python3 -m pip install \
    awscli \
    boto3

I've opened http://127.0.0.1:9901/users, created a new user called "testing1" and generated access and secret keys. I've then opened http://127.0.0.1:9901/users/testing1, click "Policies" and assign a policy of "consoleAdmin".

I'll then configure AWS CLI with these details.

$ aws configure

With the exception of the access and secret keys, I chose None for all other prompts.

Below I'll change the S3 signature version to one that is compatible with MinIO's S3 gateway.

$ aws configure set default.s3.signature_version s3v4

I'm now able to interact with the contents of the MinIO bucket on HDFS using the AWS CLI.

$ aws --endpoint-url http://127.0.0.1:9900 s3 ls mynewbucket/hosts
2021-08-10 10:51:24        224 hosts

The following is an example of fetching the bucket names available on MinIO via Boto3. By default, ~/.aws/credentials will be sourced for authentication details.

$ python3
import boto3

session = boto3.session.Session()

s3_client = session.client(
    service_name='s3',
    endpoint_url='http://127.0.0.1:9900',
)
print([x['Name'] for x in s3_client.list_buckets()['Buckets']])
['mynewbucket']

The following fetches the hosts file and prints the first two lines of its content.

print('\n'.join(s3_client.get_object(
    Bucket='mynewbucket',
    Key='hosts')['Body'].read()\
                        .decode('utf-8')\
                        .splitlines()[:2]))
127.0.0.1 localhost
127.0.1.1 localhost
Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2024 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.