The following is a collection of notes and insights I came across while examining a FizzBuzz implementation written in Assembler for 64-bit Intel and AMD CPUs supporting AVX2.
FizzBuzz is a coding exercise where numbers from 1 to n are printed out. If a number meets certain criteria their output is replaced by either "Fizz" (when divisible by 3), "Buzz" (when divisible by 5) or "FizzBuzz" (divisible by both 3 and 5).
Below is an implementation of FizzBuzz in Python which will run from 1 till 15.
def fizz_buzz(x): if x % 15 == 0: return 'fizzbuzz' elif x % 5 == 0: return 'buzz' elif x % 3 == 0: return 'fizz' else: return str(x) print('\n'.join([fizz_buzz(x) for x in range(1, 16)]))
Below is the output of the above application.
1 2 fizz 4 buzz fizz 7 8 fizz buzz 11 fizz 13 14 fizzbuzz
A year ago, Omer Tuchfeld started a coding contest to see who could write the fastest version of FizzBuzz on Stack Exchange's Code Golf site. Submissions are benchmarked on Omer's computer which has a 16-core, 32-thread AMD 5950x CPU running with a base clock of 3.4 GHz and a boost clock of upwards of 4.9 GHz, 8 MB of L2 cache and 32 GB of 3.6 GHz DDR4 RAM.
As of this writing, the 3rd fastest submission is written in Rust and produces out at a rate of 3 GB/s, 2nd is written in C and produces at a rate of 41 GB/s and the fastest is written in Assembler and produces at a rate of 56 GB/s.
The developer behind the Assembler version is Alex Smith, a doctoral researcher studying for a PhD in the School of Computer Science at the University of Birmingham in the UK. Alex is a reserve on the UK Olympic Maths Team and has a degree in electronic engineering.
In this post, I'll examine some of the optimisations found in the fastest FizzBuzz implementation to date.
The code Alex submitted is a 1,363-line file containing 581 lines of Assembler and 633 lines of comments. I couldn't find a GitHub repository maintained by Alex but I did uncover an unofficial mirror.
This implementation only supports Linux. The only five system calls made to the Linux API are exit_group, fcntl, madvise, vmsplice and write so even fairly old Kernels should be able to run this code without issue.
Before running the code, make sure your CPU does support AVX2. Most 64-bit Intel and AMD CPUs should. ARM CPUs, like those found in newer Apple computers, Smartphones or Raspberry Pis, won't support AVX2.
$ cat /proc/cpuinfo \ | grep ^flags \ | sed 's/ /\n/g' \ | tail -n+2 \ | grep avx2
You should see "avx2" appear on its own.
The following was run on an Intel i5-4670K CPU clocked at 3.4 GHz running Ubuntu 20.04.2 LTS with version 5.4.0-89 of the Linux Kernel and 16 GB of RAM. The following will install dependencies used to fetch, compile and measure the performance of the code.
$ sudo apt update $ sudo apt install \ build-essential \ git \ pv
The following will clone the source from the unofficial mirror.
$ git clone https://github.com/orent/htfizzbuzz $ cd htfizzbuzz
Rather than relying on less-portable built-in macro features found in most assemblers, GCC is used as a general-purpose macro processor. The following will turn the Assembler code into a 4,199,392-byte object file. Then, a linker will be used to transform that into a 12,424-byte 64-bit ELF binary.
$ gcc -mavx2 -c fizzbuzz.S $ ld -o fizzbuzz fizzbuzz.o
The resulting ELF can be disassembled if you wish to compare the final output to the original code.
$ objdump -d fizzbuzz | less -S
The application's I/O only supports sending the output to a pipe. Sending the output to a file, the terminal or a device will cause the application to crash.
This program can only output to a pipe (try piping into `cat`?)
If you've run the above on a machine that doesn't support AVX2 then you should see the following.
Illegal instruction (core dumped)
Below are three examples of piping the output, all of which will run without issue on AVX2-capable CPUs.
$ ./fizzbuzz | pv > /dev/null $ ./fizzbuzz | cat $ ./fizzbuzz | head
This program may produce incorrect output if piped into two commands if the middle command uses the splice system call. Alex believes this is potentially due to a bug in the Linux Kernel.
The performance can be greatly improved if the CPU cores running this application and the application being piped to are siblings. The CPU cores applications use are typically determined arbitrarily by the Kernel but you can pin them to specific cores using taskset.
The following will run fizzbuzz on core 1 and run pv on core 2.
$ taskset 1 ./fizzbuzz | taskset 2 pv > /dev/null
The following will run pv on core 4 and it should run considerably slower than the above.
$ taskset 1 ./fizzbuzz | taskset 4 pv > /dev/null
The above optimisation lead to a 1.25x speed up during Omer's benchmark.
Analysis of Optimisations
The application uses three includes. The Linux API call code for F_SETPIPE_SZ is defined manually as it is a relatively more recent addition to the Kernel.
#include <asm/errno.h> #include <asm/mman.h> #include <asm/unistd.h> #define F_SETPIPE_SZ 1031 // not in asm headers, define it manually
There are 37 CPU registers that are earmarked for defined purposes. Below are a few of them.
#define PIPE_SIZE %r13 #define LINENO_WIDTH %r14 #define LINENO_WIDTHe %r14d #define GROUPS_OF_15 %r15 #define GROUPS_OF_15e %r15d #define LINENO_LOW %ymm4 #define LINENO_MID %ymm5 #define LINENO_MIDx %xmm5 #define LINENO_TOP %ymm6 #define LINENO_TOPx %xmm6
System calls to Linux are done manually. The registers to use during a call and inspect its response are defined at the top of the file.
#define ARG1 %rdi #define ARG1e %edi #define ARG2 %rsi #define ARG2e %esi #define ARG3 %rdx #define ARG3e %edx #define ARG4 %r10 #define ARG4e %r10d #define SYSCALL_RETURN %rax #define SYSCALL_RETURNe %eax #define SYSCALL_NUMBER %eax
And then those macros are used later on whenever a system call is needed. Below is a snippet where the Kernel is asked to resize the pipe of the standard output. There are also a few errors that the result is tested for.
mov ARG1e, 1 mov ARG2e, F_SETPIPE_SZ mov ARG3e, %ecx mov SYSCALL_NUMBER, __NR_fcntl syscall cmp SYSCALL_RETURNe, -EBADF je pipe_error cmp SYSCALL_RETURNe, -EPERM je pipe_perm_error call exit_on_error cmp SYSCALL_RETURN, PIPE_SIZE jne pipe_size_mismatch_error
The CPU's L2 cache is split into two halves, one for calculating the output of FizzBuzz, the other storing the result of the last calculation batch. This means there is no need to use any slower L3 memory operations.
mov %eax, 0x80000000 // asks which CPUID extended commands exist cpuid // returns the highest supported command in %eax cmp %eax, 0x80000006 // does 0x80000006 give defined results? jb bad_cpuid_error mov %eax, 0x80000006 // asks about the L2 cache size cpuid // returns size in KiB in the top half of %ecx shr %ecx, 16 jz bad_cpuid_error // unsupported commands return all-0s shl %ecx, 10 - 1 // convert KiB to bytes, then halve mov PIPE_SIZE, %rcx
The application also asks the Kernel to defragment the physical memory it's using via the madvise system call. This is handy later on as it simplifies any calculation used to look up memory addresses. Again, Alex believes this leads to a 30% speed increase.
lea ARG1, [%rip + io_buffers] mov ARG2e, 3 * (2 << 20) mov ARG3e, MADV_HUGEPAGE mov SYSCALL_NUMBER, __NR_madvise syscall
This application has two global variables used for I/O buffers. They use 2 MB of RAM each even though neither need 2 MB. This keeps the page table orderly and avoids contention. Alex stated this also sped up the code by another 30%.
io_buffers: .zero 2 * (2 << 20) iovec_base: // I/O config buffer for vmsplice(2) system call .zero 16 error_write_buffer: // I/O data buffer for write(2) system call .zero 1 .p2align 9,0 bytecode_storage: // the rest is a buffer for storing bytecode .zero (2 << 20) - 512
Most competing FizzBuzz implementations call the write system call which can lead to data needing to be copied between user and Kernel space. So vmsplice is used instead. It tells the Kernel to place a reference to a buffer into a pipe. The pipped application can then read directly out of the memory written to by FizzBuzz.
swap_buffers: and OUTPUT_PTR, -(2 << 20) // rewind to the start of the buffer mov [%rip + iovec_base], OUTPUT_PTR mov [%rip + iovec_base + 8], %rdx mov ARG1e, 1 lea ARG2, [%rip + iovec_base] mov ARG3e, 1 xor ARG4e, ARG4e 1: mov SYSCALL_NUMBER, __NR_vmsplice syscall
The application produces 1,000,000,000,000,000,000 lines of output. There is a bytecode generator that produces batches of 600 lines at a time using SIMD instructions. Each 32 bytes of bytecode can be interpreted and have their output stored with just 4 CPU instructions. Some calculations have also been hard-coded into the bytecode in a way not dissimilar to how JIT compilers operate.
During each 600-line generation, an approximation of the line number is also produced. The lines number is corrected after every 512 bytes of output has been produced.
The application produced 64 bytes of FizzBuzz for every 4 CPU clock cycles. Alex states the ultimate bottleneck of performance is based on the throughput of the CPU's L2 cache.
Alex did attempt to build a multi-threaded version but was unable to find any performance improvement over the single-threaded version. The reason stated was that the cost of communication between CPU cores is sufficiently high enough that any gains from parallel computation are ultimately negated.