Home | Benchmarks | Archives | Atom Feed

Posted on Thu 04 August 2016

TensorFlow on a GTX 1080

In November 2015, Google open sourced TensorFlow, a Deep Learning library based off of their internal Deepnet software DistBelief v2, which was developed for the Google Brain project. Deep Learning could very well be driving a lot of product development in the next few decades. When software is able to detect and classify objects and perform semantic and syntactic analogy things like translating signs with your phone's camera become possible.

TensorFlow supports implementing a wide variety of Deep Learning algorithms including:

Though you can train Deep Learning models using CPUs with TensorFlow, the large matrices being manipulated are what GPUs have been optimised for. Games use matrices intensively for all sorts of 3D functionality and GPUs are specifically designed to handle matrix manipulation orders of magnitude faster than regular CPUs.

TensorFlow has a Python-based API. The Python you write is used to build data and control structures before it's then handed off to either the CPU or GPU. The great thing about this is you have an easy to learn and easy to read and comprehend programming language to interface with without the slow execution times you'd come to expect with such computationally-intensive tasks. And unless you target a specific device on your machine, the code you write is agnostic; there aren't any specific CUDA calls or Intel-only endpoints to worry about. TensorFlow automatically prioritises using the GPU on operations where it makes sense performance-wise.

In this blog post I will build "Deep Fizz buzz" by training a Deepnet to try and give the correct answers for the first 100 values of Fizz buzz.

TensorFlow does support training models across clusters of machines but for this exercise I'll be using a single PC. I will train the Deepnet using an Nvidia GTX 1080. The GTX 1080 replaced my Radeon HD 7870 after I found TensorFlow has yet to support OpenCL and has a dependency on Nvidia's CUDA platform for any GPU-based training. The GTX 1080 draws up to 180 Watts of power compared to the 175 Watts the HD 7870 draws and they're both the same physical size so replacement was easy.

TensorFlow, Up & Running

A special thanks to Sai Soundararaj for the excellent installation notes he's put together and to everyone commenting in the GitHub issues for TensorFlow that have been kind enough to debug and share a set of version numbers for all the software that works well together.

The following was run on a fresh Ubuntu 16.04.1 LTS installation. Normally I use Ubuntu 14.04.3 LTS as I usually run everything in a virtual machine but I needed my code to be able to speak to the GPU directly and neither VMWare Workstation not VirtualBox showed any promise at getting GPU passthrough to work properly so I ended up having to install Ubuntu on my machine. Ubuntu 14 didn't like the looks of my USB devices and didn't want to play ball but Ubuntu 16 installed nicely.

I'll first install a few dependencies to support TensorFlow's Python-based environment and a whole stack of Nvidia software.

$ sudo apt-get update
$ sudo apt-get install \
    freeglut3-dev \
    g++-4.9 \
    gcc-4.9 \
    libglu1-mesa-dev \
    libx11-dev \
    libxi-dev \
    libxmu-dev \
    nvidia-modprobe \
    python-dev \
    python-pip \
    python-virtualenv

I'll install a version of Nvidia's drivers which have been known to play well with TensorFlow and it's dependencies.

$ sudo apt-get purge nvidia-*
$ sudo add-apt-repository ppa:graphics-drivers/ppa
$ sudo apt-get update
$ sudo apt-get install nvidia-367

With the driver and it's dependencies installed I'll reboot the system.

$ sudo reboot

Once the machine is back up you can see version 367.35 is now installed:

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.35  Mon Jul 11 23:14:21 PDT 2016
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.1)

The following is the output from Nvidia's system management interface showing various diagnostics from my GTX 1080.

$ sudo nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
| 25%   50C    P2    38W / 200W |     55MiB /  8112MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3084    G   /usr/lib/xorg/Xorg                              53MiB |
+-----------------------------------------------------------------------------+

I'll set GCC 4.9 to be the default version being used on this system. The CUDA toolkit complains about this version while installing but I found it works nonetheless.

$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 10
$ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.9 20

$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 10
$ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.9 20

Next I'll download the 64-bit version of the CUDA 7.5 platform distribution for Ubuntu 14 (even though this is running on Ubuntu 16). Version 8 is the latest version but isn't supported by TensorFlow yet. If you need a different version please see Nvidia's CUDA downloads page.

There will be a complaint about using GCC 4.9 so I've added an --override flag to get around this.

$ wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
$ sudo sh cuda_7.5.18_linux.run --override
Do you accept the previously read EULA? (accept/decline/quit): accept
You are attempting to install on an unsupported configuration. Do you wish to continue? ((y)es/(n)o) [ default is no ]: yes
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): no
Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): yes
Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]:
Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): yes
Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): no
Installing the CUDA Toolkit in /usr/local/cuda-7.5 ...

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-7.5
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-7.5/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-7.5/lib64, or, add /usr/local/cuda-7.5/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-7.5/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-7.5/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 352.00 is required for CUDA 7.5 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_14557.log

I'll then add the environment variables for the CUDA platform to my .bashrc file.

$ echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
$ echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
$ source ~/.bashrc

I can now see the CUDA compiler is installed properly.

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

I'll then download the CUDA Deep Neural Network library. This library contains a set of primitives used by Deepnets. This library is available to members of the Accelerated Computing Developer Program so once you've joined you'll be able to download version 4.0.7.

$ wget .../cudnn-70-linux-x64-v40
$ tar xvf cudnn-7.0-linux-x64-v4.0-prod.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
$ sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

Finally I'll install the binary for the 0.10.0 release candidate 0 of TensorFlow. Using the binary release instead of compiling from source saves needing to find and install specific versions of Bazel 0.2.3, Protobuf 3.0.0b2 and avoid any woes with GCC.

$ virtualenv tf_gpu
$ source tf_gpu/bin/activate
$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl

Deep Fizz Buzz

I'll train a model which Joel Grus has released into the public domain after he used it, unsuccessfully, to try and land a job. I've added in telemetry collection for analysis in TensorBoard after the model has been trained. The telemetry is being stored in the /tmp/train folder.

$ vi deep_fizz_buzz.py
import numpy as np
import tensorflow as tf

NUM_DIGITS = 10

# Represent each input by an array of its binary digits.
def binary_encode(i, num_digits):
    return np.array([i >> d & 1 for d in range(num_digits)])

# One-hot encode the desired outputs: [number, "fizz", "buzz", "fizzbuzz"]
def fizz_buzz_encode(i):
    if   i % 15 == 0: return np.array([0, 0, 0, 1])
    elif i % 5  == 0: return np.array([0, 0, 1, 0])
    elif i % 3  == 0: return np.array([0, 1, 0, 0])
    else:             return np.array([1, 0, 0, 0])

# Our goal is to produce fizzbuzz for the numbers 1 to 100. So it would be
# unfair to include these in our training data. Accordingly, the training data
# corresponds to the numbers 101 to (2 ** NUM_DIGITS - 1).
trX = np.array([binary_encode(i, NUM_DIGITS) for i in range(101, 2 ** NUM_DIGITS)])
trY = np.array([fizz_buzz_encode(i)          for i in range(101, 2 ** NUM_DIGITS)])

# We'll want to randomly initialize weights.
def init_weights(shape):
    return tf.Variable(tf.random_normal(shape, stddev=0.01))

# Our model is a standard 1-hidden-layer multi-layer-perceptron with ReLU
# activation. The softmax (which turns arbitrary real-valued outputs into
# probabilities) gets applied in the cost function.
def model(X, w_h, w_o):
    h = tf.nn.relu(tf.matmul(X, w_h))
    return tf.matmul(h, w_o)

# Our variables. The input has width NUM_DIGITS, and the output has width 4.
X = tf.placeholder("float", [None, NUM_DIGITS])
Y = tf.placeholder("float", [None, 4])

# How many units in the hidden layer.
NUM_HIDDEN = 100

# Initialize the weights.
w_h = init_weights([NUM_DIGITS, NUM_HIDDEN])
w_o = init_weights([NUM_HIDDEN, 4])

# Predict y given x using the model.
py_x = model(X, w_h, w_o)

# We'll train our model by minimizing a cost function.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, Y))
train_op = tf.train.GradientDescentOptimizer(0.05).minimize(cost)

# And we'll make predictions by choosing the largest output.
predict_op = tf.argmax(py_x, 1)

# Finally, we need a way to turn a prediction (and an original number)
# into a fizz buzz output
def fizz_buzz(i, prediction):
    return [str(i), "fizz", "buzz", "fizzbuzz"][prediction]

BATCH_SIZE = 128

# Launch the graph in a session
with tf.Session() as sess:
    tf.initialize_all_variables().run()
    train_writer = tf.train.SummaryWriter('/tmp/train', sess.graph)
    merged_summary_op = tf.merge_all_summaries()

    for epoch in range(10000):
        # Shuffle the data before each training iteration.
        p = np.random.permutation(range(len(trX)))
        trX, trY = trX[p], trY[p]

        # Train in batches of 128 inputs.
        for iteration, start in enumerate(range(0, len(trX), BATCH_SIZE)):
            end = start + BATCH_SIZE
            feed_dict = {
                X: trX[start:end],
                Y: trY[start:end]
            }
            summary = sess.run(train_op,
                               feed_dict)

            if iteration and not iteration % 32:
                train_writer.add_summary(sess.run(merged_summary_op),
                                         iteration)

        # And print the current accuracy on the training data.
        if not epoch % 250:
            feed_dict = {X: trX, Y: trY}
            print epoch, np.mean(np.argmax(trY, axis=1) ==
                                 sess.run(predict_op, feed_dict))


    # And now for some fizz buzz
    numbers = np.arange(1, 101)
    teX = np.transpose(binary_encode(numbers, NUM_DIGITS))
    teY = sess.run(predict_op, feed_dict={X: teX})
    output = np.vectorize(fizz_buzz)(numbers, teY)

    print output
$ python deep_fizz_buzz.py

The above finished in a few minutes.

While training this model you can see the CPU usage isn't completely maxed out.

$ top
top - 22:52:18 up 6 min,  2 users,  load average: 10,12, 5,85, 2,76
Tasks: 159 total,  10 running, 145 sleeping,   0 stopped,   4 zombie
%Cpu0  : 62,7 us,  5,9 sy,  0,0 ni, 31,4 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu1  : 56,1 us,  7,1 sy,  0,0 ni, 30,6 id,  0,0 wa,  0,0 hi,  6,1 si,  0,0 st
%Cpu2  : 60,8 us,  6,2 sy,  0,0 ni, 33,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
%Cpu3  :  0,0 us,100,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 32831072 total, 30551628 free,   808544 used,  1470900 buff/cache
KiB Swap: 33435644 total, 33435644 free,        0 used. 31611888 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3496 mark      20   0 21,876g 715020 203328 S 204,0  2,2   1:37.98 python deep_fizz_buzz.py
 3084 root      20   0  181416  44096  29960 R  99,0  0,1   5:32.06 /usr/lib/xorg/Xorg vt7 -displayfd 3 -auth /var/lib/gdm3/.cache/gdm+

TensorFlow will allocate almost all of the GPU's memory in an attempt to reduce the effects of memory fragmentation. Below you can see Nvidia's system management interface showing almost all the memory is in use.

$ sudo nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
| 23%   50C    P2    44W / 200W |   7762MiB /  8112MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3084    G   /usr/lib/xorg/Xorg                              53MiB |
|    0      3496    C   python                                        7705MiB |
+-----------------------------------------------------------------------------+

Here's the final output:

['1' '2' 'fizz' '4' 'buzz' 'fizz' '7' '8' 'fizz' '10' '11' '12' '13' '14'
 'fizzbuzz' '16' '17' 'fizz' '19' 'buzz' 'fizz' '22' '23' 'fizz' 'buzz'
 '26' 'fizz' '28' '29' 'fizzbuzz' '31' '32' 'fizz' 'fizz' 'buzz' '36' '37'
 '38' 'fizz' '40' '41' 'fizz' '43' '44' 'fizzbuzz' '46' '47' 'fizz' '49'
 'buzz' 'fizz' '52' '53' 'fizz' 'buzz' '56' '57' '58' '59' 'fizzbuzz' '61'
 '62' 'fizz' '64' 'buzz' 'fizz' '67' '68' 'fizz' 'buzz' '71' '72' '73' '74'
 'fizzbuzz' '76' '77' 'fizz' '79' 'buzz' 'fizz' '82' '83' 'fizz' 'buzz'
 '86' 'fizz' '88' '89' 'fizzbuzz' '91' '92' '93' '94' 'buzz' 'fizz' '97'
 '98' 'fizz' '100']

I'll populate a variable called generated with that list and then check it against a more deterministically generated list.

def fizz_buzz(x):
    if   x % 15 == 0: return 'fizzbuzz'
    elif x % 5  == 0: return 'buzz'
    elif x % 3  == 0: return 'fizz'
    else:             return str(x)

correct = [fizz_buzz(x) for x in range(1, 101)]
In [19]: sum([val == generated[count]
              for count, val in enumerate(correct)])
Out[19]: 91

91 correct out of a possible 100.

Tensorboard

TensorFlow ships with a web application that will visualise the telemetry produced during the model training. To launch Tensorboard and bring up the graphs page, run the following:

$ tensorboard --logdir=/tmp/train &
$ open 'http://127.0.0.1:6006/#graphs'

Conclusion

Getting all the dependencies in place took a lot of research but now that they're installed I can focus on learning how to improve the accuracy of this model and to learn more about Deep Learning in general.

Thank you for taking the time to read this post. I offer consulting, architecture and hands-on development services to clients in Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.

Copyright © 2014 - 2017 Mark Litwintschik. This site's template is based off a template by Giulio Fidente.