The Fastest FizzBuzz Implementation

The following is a collection of notes and insights I came across while examining a FizzBuzz implementation written in Assembler for 64-bit Intel and AMD CPUs supporting AVX2.

FizzBuzz is a coding exercise where numbers from 1 to n are printed out. If a number meets certain criteria their output is replaced by either "Fizz" (when divisible by 3), "Buzz" (when divisible by 5) or "FizzBuzz" (divisible by both 3 and 5).

Below is an implementation of FizzBuzz in Python which will run from 1 till 15.

def fizz_buzz(x):
    if   x % 15 == 0: return 'fizzbuzz'
    elif x % 5  == 0: return 'buzz'
    elif x % 3  == 0: return 'fizz'
    else:             return str(x)

print('\n'.join([fizz_buzz(x)
       for x in range(1, 16)]))

Below is the output of the above application.

1
2
fizz
4
buzz
fizz
7
8
fizz
buzz
11
fizz
13
14
fizzbuzz

A year ago, Omer Tuchfeld started a coding contest to see who could write the fastest version of FizzBuzz on Stack Exchange's Code Golf site. Submissions are benchmarked on Omer's computer which has a 16-core, 32-thread AMD 5950x CPU running with a base clock of 3.4 GHz and a boost clock of upwards of 4.9 GHz, 8 MB of L2 cache and 32 GB of 3.6 GHz DDR4 RAM.

As of this writing, the 3rd fastest submission is written in Rust and produces out at a rate of 3 GB/s, 2nd is written in C and produces at a rate of 41 GB/s and the fastest is written in Assembler and produces at a rate of 56 GB/s.

The developer behind the Assembler version is Alex Smith, a doctoral researcher studying for a PhD in the School of Computer Science at the University of Birmingham in the UK. Alex is a reserve on the UK Olympic Maths Team and has a degree in electronic engineering.

In this post, I'll examine some of the optimisations found in the fastest FizzBuzz implementation to date.

Building FizzBuzz

The code Alex submitted is a 1,363-line file containing 581 lines of Assembler and 633 lines of comments. I couldn't find a GitHub repository maintained by Alex but I did uncover an unofficial mirror.

This implementation only supports Linux. The only five system calls made to the Linux API are exit_group, fcntl, madvise, vmsplice and write so even fairly old Kernels should be able to run this code without issue.

Before running the code, make sure your CPU does support AVX2. Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those found in newer Apple computers, Smartphones or Raspberry Pis, won't support AVX2.

$ cat /proc/cpuinfo \
    | grep ^flags \
    | sed 's/ /\n/g' \
    | tail -n+2 \
    | grep avx2

You should see "avx2" appear on its own.

avx2

The following was run on an Intel i5-4670K CPU clocked at 3.4 GHz running Ubuntu 20.04.2 LTS with version 5.4.0-89 of the Linux Kernel and 16 GB of RAM. The following will install dependencies used to fetch, compile and measure the performance of the code.

$ sudo apt update
$ sudo apt install \
    build-essential \
    git \
    pv

The following will clone the source from the unofficial mirror.

$ git clone https://github.com/orent/htfizzbuzz
$ cd htfizzbuzz

Rather than relying on less-portable built-in macro features found in most assemblers, GCC is used as a general-purpose macro processor. The following will turn the Assembler code into a 4,199,392-byte object file. Then, a linker will be used to transform that into a 12,424-byte 64-bit ELF binary.

$ gcc -mavx2 -c fizzbuzz.S
$ ld -o fizzbuzz fizzbuzz.o

The resulting ELF can be disassembled if you wish to compare the final output to the original code.

$ objdump -d fizzbuzz | less -S

Running FizzBuzz

The application's I/O only supports sending the output to a pipe. Sending the output to a file, the terminal or a device will cause the application to crash.

$ ./fizzbuzz

This program can only output to a pipe (try piping into `cat`?)

If you've run the above on a machine that doesn't support AVX2 then you should see the following.

Illegal instruction (core dumped)

Below are three examples of piping the output, all of which will run without issue on AVX2-capable CPUs.

$ ./fizzbuzz | pv > /dev/null
$ ./fizzbuzz | cat
$ ./fizzbuzz | head

This program may produce incorrect output if piped into two commands if the middle command uses the splice system call. Alex believes this is potentially due to a bug in the Linux Kernel.

The performance can be greatly improved if the CPU cores running this application and the application being piped to are siblings. The CPU cores applications use are typically determined arbitrarily by the Kernel but you can pin them to specific cores using taskset.

The following will run fizzbuzz on core 1 and run pv on core 2.

$ taskset 1 ./fizzbuzz | taskset 2 pv > /dev/null

The following will run pv on core 4 and it should run considerably slower than the above.

$ taskset 1 ./fizzbuzz | taskset 4 pv > /dev/null

The above optimisation lead to a 1.25x speed up during Omer's benchmark.

Analysis of Optimisations

The application uses three includes. The Linux API call code for F_SETPIPE_SZ is defined manually as it is a relatively more recent addition to the Kernel.

#include <asm/errno.h>
#include <asm/mman.h>
#include <asm/unistd.h>
#define F_SETPIPE_SZ 1031 // not in asm headers, define it manually

There are 37 CPU registers that are earmarked for defined purposes. Below are a few of them.

#define PIPE_SIZE %r13
#define LINENO_WIDTH %r14
#define LINENO_WIDTHe %r14d
#define GROUPS_OF_15 %r15
#define GROUPS_OF_15e %r15d
#define LINENO_LOW %ymm4
#define LINENO_MID %ymm5
#define LINENO_MIDx %xmm5
#define LINENO_TOP %ymm6
#define LINENO_TOPx %xmm6

System calls to Linux are done manually. The registers to use during a call and inspect its response are defined at the top of the file.

#define ARG1 %rdi
#define ARG1e %edi
#define ARG2 %rsi
#define ARG2e %esi
#define ARG3 %rdx
#define ARG3e %edx
#define ARG4 %r10
#define ARG4e %r10d
#define SYSCALL_RETURN %rax
#define SYSCALL_RETURNe %eax
#define SYSCALL_NUMBER %eax

And then those macros are used later on whenever a system call is needed. Below is a snippet where the Kernel is asked to resize the pipe of the standard output. There are also a few errors that the result is tested for.

mov ARG1e, 1
mov ARG2e, F_SETPIPE_SZ
mov ARG3e, %ecx
mov SYSCALL_NUMBER, __NR_fcntl
syscall
cmp SYSCALL_RETURNe, -EBADF
je pipe_error
cmp SYSCALL_RETURNe, -EPERM
je pipe_perm_error
call exit_on_error
cmp SYSCALL_RETURN, PIPE_SIZE
jne pipe_size_mismatch_error

The CPU's L2 cache is split into two halves, one for calculating the output of FizzBuzz, the other storing the result of the last calculation batch. This means there is no need to use any slower L3 memory operations.

mov %eax, 0x80000000 // asks which CPUID extended commands exist
cpuid                // returns the highest supported command in %eax
cmp %eax, 0x80000006 // does 0x80000006 give defined results?
jb bad_cpuid_error

mov %eax, 0x80000006 // asks about the L2 cache size
cpuid                // returns size in KiB in the top half of %ecx
shr %ecx, 16
jz bad_cpuid_error   // unsupported commands return all-0s

shl %ecx, 10 - 1     // convert KiB to bytes, then halve
mov PIPE_SIZE, %rcx

The application also asks the Kernel to defragment the physical memory it's using via the madvise system call. This is handy later on as it simplifies any calculation used to look up memory addresses. Again, Alex believes this leads to a 30% speed increase.

lea ARG1, [%rip + io_buffers]
mov ARG2e, 3 * (2 << 20)
mov ARG3e, MADV_HUGEPAGE
mov SYSCALL_NUMBER, __NR_madvise
syscall

This application has two global variables used for I/O buffers. They use 2 MB of RAM each even though neither need 2 MB. This keeps the page table orderly and avoids contention. Alex stated this also sped up the code by another 30%.

io_buffers:
.zero 2 * (2 << 20)

iovec_base:          // I/O config buffer for vmsplice(2) system call
.zero 16
error_write_buffer:  // I/O data buffer for write(2) system call
.zero 1
.p2align 9,0
bytecode_storage:    // the rest is a buffer for storing bytecode
.zero (2 << 20) - 512

Most competing FizzBuzz implementations call the write system call which can lead to data needing to be copied between user and Kernel space. So vmsplice is used instead. It tells the Kernel to place a reference to a buffer into a pipe. The pipped application can then read directly out of the memory written to by FizzBuzz.

swap_buffers:
and OUTPUT_PTR, -(2 << 20)  // rewind to the start of the buffer
mov [%rip + iovec_base], OUTPUT_PTR
mov [%rip + iovec_base + 8], %rdx
mov ARG1e, 1
lea ARG2, [%rip + iovec_base]
mov ARG3e, 1
xor ARG4e, ARG4e

1: mov SYSCALL_NUMBER, __NR_vmsplice
syscall

The application produces 1,000,000,000,000,000,000 lines of output. There is a bytecode generator that produces batches of 600 lines at a time using SIMD instructions. Each 32 bytes of bytecode can be interpreted and have their output stored with just 4 CPU instructions. Some calculations have also been hard-coded into the bytecode in a way not dissimilar to how JIT compilers operate.

During each 600-line generation, an approximation of the line number is also produced. The lines number is corrected after every 512 bytes of output has been produced.

The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. Alex states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache.

Alex did attempt to build a multi-threaded version but was unable to find any performance improvement over the single-threaded version. The reason stated was that the cost of communication between CPU cores is sufficiently high enough that any gains from parallel computation are ultimately negated.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.