Faster ZIP Decompression

Having a wide variety of compression formats means you can pick between various trade-offs. Unfortunately, often when presented with a data feed those decisions have already been made on your behalf.

Though the popular ZIP archive file format supports multiple lossless data compression schemes, DEFLATE remains almost ubiquitous. DEFLATE was defined by Phil Katz in 1989 and is a derivative of the LZ77 algorithm originally published twelve years prior to that. In addition to ZIP files, DEFLATE has found use in GZIP and PNG files as well as the HTTP protocol.

An open-source implementation of DEFLATE was included in zlib, which was first released in 1995. The compression functionality of zlib was produced by Jean-loup Gailly and the decompression components were written by Mark Adler. The zlib library is considered to be one of the most widely distributed pieces of code in the world as it's used by a huge variety of software. Any security vulnerability in the library could have a significant global impact. Thankfully the C code making up the latest version is only 18K lines long and there have been ports to C++ which were a little over 3K lines long.

In this post, I'll present my findings for the fastest ZIP file decompression method I've been able to find.

Installing Tooling

I'll first install the Debian-patched version of Info-ZIP's unzip 6.0 and zip 3.0 utilities, the Debian-patched version of 7-Zip 9.20.1, CPython 2.7.12 and a host of other utilities on a fresh installation of Ubuntu 16.04.2 LTS.

$ sudo apt update
$ sudo apt install \
    binutils \
    build-essential \
    git \
    p7zip-full \
    python-dev \
    unzip \
    wget \
    zip

An Example Dataset

I'll produce a ZIP file containing a subset of files that make up the Hadoop 3.0.3 binary distribution provided by the Apache Foundation.

$ wget -c -O hadoop.tar.gz \
    https://archive.apache.org/dist/hadoop/common/hadoop-3.0.3/hadoop-3.0.3.tar.gz

The above archive is 300 MB in size and contains over 20K files.

$ tar -ztvf hadoop.tar.gz | grep -cv '/$'

I'll extract all of the contents sans the documentations into a working folder.

$ mkdir -p working
$ tar xvf hadoop.tar.gz \
      --directory=./working \
      --exclude=hadoop-3.0.3/share/doc \
      --strip 1

I'm left with 672 decompressed files adding up to 380 MB in size. The largest file is ~82 MB decompressed, four files are larger than 10 MB and then there is a long tail down to files in the tens of bytes.

$ find working -type f -print | wc -l

I'll use the strongest DEFLATE compression Info-ZIP supports to create a ZIP archive of the working folder. The resulting ZIP file is 267 MB in size.

$ zip -r -9 hadoop.zip working/

Benchmarking Inflation

In these benchmarks, I'll look at single-threaded, sequentially executed workloads. The machine used has an Intel Core i5 4670K clocked at 3.4 GHz, 8 GB of DDR3 RAM, a SanDisk SDSSDHII960G 960 GB SSD drive which is connected via a SATA interface. For the record, the Intel chip I'm using does support SSE 4.2, AVX and AVX2.

$ lscpu | tail -n1 | sed 's/ /\n/g' | grep 'avx\|sse' | sort

avx
avx2
sse
sse2
sse4_1
sse4_2
ssse3

The first tool I'll benchmark is the Debian-patched Info-ZIP DEFLATE decompression implementation.

$ vi run_1

for i in {1..25}; do
    unzip -qq -o hadoop.zip -d working/
done

$ time bash run_1

The above completed in 54.55 seconds.

For the record, the above binary doesn't appear to be built with zlib. I examined the source code and it is possible to exclude zlib while retaining all the DEFLATE functionality.

$ objdump --dynamic-syms /usr/bin/unzip | grep -c ZLIB

The above doesn't show any mention of zlib. I grep'ed the binary looking for the ZLIB_1 string that appears when zlib has been statically linked and nothing turned up.

$ strings /usr/bin/unzip | grep -ic ZLIB_

Next, I'll try CPython's unzip functionality.

$ vi run_2

for i in {1..25}; do
    python -m zipfile -e hadoop.zip working/
done

$ time bash run_2

The above completed in 56.84 seconds.

Next, I'll use Archiver version 3.2.0 which was built by Matt Holt in Golang. Golang is interesting in that it was co-designed by Ken Thompson, the inventor of UNIX and co-inventor of UTF-8, at Google where he's employed as a Distinguished Engineer. Go can take high-level code and compile it into binaries that can, among other things, take advantage of CPU vectorisation.

$ wget -c -O arc \
    https://github.com/mholt/archiver/releases/download/v3.2.0/arc_linux_amd64
$ chmod +x arc

$ vi run_3

for i in {1..25}; do
    ./arc -overwrite unarchive hadoop.zip
done

$ time bash run_3

The above completed in 100.975 seconds.

Next I'll benchmark version 9.20.1 of the Debian-patched 7-Zip utility.

$ vi run_4

for i in {1..25}; do
    7z x -bd -y -oworking/ hadoop.zip 1>/dev/null
done

$ time bash run_4

The above completed in 72.2 seconds.

Version 9.20.1 of 7-Zip above was released in 2010. I'll fetch a more recent Debian release of 7-Zip. The following will install version 16.02 which was released in mid-2016.

$ wget -c http://http.us.debian.org/debian/pool/main/p/p7zip/p7zip_16.02+dfsg-7_amd64.deb
$ sudo dpkg -i p7zip_16.02+dfsg-7_amd64.deb

I'll then run the 7-Zip benchmark again.

$ time bash run_4

The above completed in 58.05 seconds.

Next I'll compile zlib "next generation". This zlib fork is also being worked on by Mark Adler and has several performance-related patches produced by firms such as Cloudflare and Intel. The past two years has seen over 70 commits from Sebastian Pop who has a long CV with roles at firms such as AWS, Samsung R&D, Qualcomm, AMD, IBM as well as the FSF and the LLVM project.

The following was run using GCC version 5.4.0.

$ git clone https://github.com/zlib-ng/zlib-ng.git ~/zlib-ng
$ cd ~/zlib-ng
$ mkdir -p out
$ CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY" \
    ./configure \
        --prefix=./out \
        --zlib-compat
$ make
$ make install

Python is dynamically linked to zlib so I can use the above as a drop-in replacement for its unzip functionality.

$ objdump --dynamic-syms /usr/bin/python2.7 | grep ZLIB
0000000000000000      DF *UND*  0000000000000000  ZLIB_1.2.0  inflateCopy

$ cd ~
$ vi run_5

for i in {1..25}; do
    LD_PRELOAD=/home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \
        /usr/bin/python2.7 \
            -m zipfile \
            -e hadoop.zip \
            working/
done

$ time bash run_5

The above completed in 48.9 seconds.

Finally, I'll compile my own version of Info-ZIP's unzip utility and see if I can find any performance improvements over the Debian-patched release.

$ git clone https://github.com/LuaDist/unzip.git
$ cd unzip
$ mkdir -p out/bin
$ export CFLAGS="-O4 -m64 -DZLIB_COMPAT -DWITH_GZFILEOP -DHAVE_HIDDEN -DHAVE_BUILTIN_CTZL -DMEDIUM_STRATEGY -DX86_64 -DX86_NOCHECK_SSE2 -DUNALIGNED_OK -DUNROLL_LESS -DX86_CPUID -DX86_SSE2_FILL_WINDOW -DX86_SSE4_2_CRC_HASH -DX86_PCLMULQDQ_CRC -DX86_QUICK_STRATEGY"
$ make -f unix/Makefile generic
$ make prefix=./out -f unix/Makefile install

The unzip binary I've produced is about 90% smaller and wasn't built with bzip2's library.

$ ls -alht ~/unzip/out/bin/unzip /usr/bin/unzip

-rwxr-xr-x 2 mark mark 175K Oct 27 05:01 /home/mark/unzip/out/bin/unzip
-rwxr-xr-x 2 root root 159K Nov 20  2015 /usr/bin/unzip

$ objdump --dynamic-syms ~/unzip/out/bin/unzip | grep -v GLIBC

/home/mark/unzip/out/bin/unzip:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
0000000000000000  w   D  *UND*  0000000000000000              __gmon_start__

$ objdump --dynamic-syms /usr/bin/unzip | grep -v GLIBC

/usr/bin/unzip:     file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
0000000000000000  w   D  *UND*  0000000000000000              _ITM_deregisterTMCloneTable
0000000000000000      DF *UND*  0000000000000000              BZ2_bzlibVersion
0000000000000000      DF *UND*  0000000000000000              BZ2_bzDecompressInit
0000000000000000      DF *UND*  0000000000000000              BZ2_bzDecompressEnd
0000000000000000  w   D  *UND*  0000000000000000              __gmon_start__
0000000000000000      DF *UND*  0000000000000000              BZ2_bzDecompress
0000000000000000  w   D  *UND*  0000000000000000              _Jv_RegisterClasses
0000000000000000  w   D  *UND*  0000000000000000              _ITM_registerTMCloneTable
0000000000627304 g    D  .data  0000000000000000  Base        _edata
00000000007198b8 g    D  .bss   0000000000000000  Base        _end
0000000000627304 g    D  .bss   0000000000000000  Base        __bss_start
00000000004019c8 g    DF .init  0000000000000000  Base        _init
000000000041acd4 g    DF .fini  0000000000000000  Base        _fini

$ cd ~
$ vi run_6

for i in {1..25}; do
    /home/mark/unzip/out/bin/unzip -qq -o hadoop.zip -d ~/working/
done

$ time bash run_6

The above completed in 52.9 seconds.

Below is a summary of the benchmark times. The fastest time is in seconds while all other times are a multiple of how much slower they were.

Seconds	Method
48.9s	CPython 2.7.12 & zlib-ng 1.2.11
1.08x	Info-ZIP's unzip 6.0 (self-compiled w/o zlib)
1.12x	Info-ZIP's unzip 6.0 (Debian distribution w/o zlib)
1.16x	CPython & zlib 1.2.0
1.19x	7-Zip 16.02
1.48x	7-Zip 9.20.1
2.04x	Archiver 3.2.0

Vectorisation Advantage?

Imagine having a toolbox on one side of a room and a workbench on the other end. Instead of taking tools out of the toolbox one at a time and bringing them over to the workbench you instead brought 4 or 8 tools at a time. That's Vectorisation.

Intel has been including opcodes in their CPUs for vectorisation since the mid-1990s. In 2011, Intel and AMD began shipping CPUs that supported AVX. AVX2 chips followed in 2012 and AVX-512 chips from Intel followed in 2016.

The performance improvements require that vectorised opcodes are used in any code executed. The following is Assembly code written by Denis Bakhvalov in a scalar (non-vectorised) fashion.

mov edx, dword ptr [r9 + 4*rsi]         # loading rhs[i]
add edx, 1                              # rhs[i] + 1
shr edx                                 # (rhs[i] + 1) >> 1
mov dword ptr [rax + 4*rsi], edx        # store result to lhs
mov esi, edi
add edi, 1                              # incrementing i by 1
cmp rcx, rsi
ja <next iteration>

The following performs the same workload as above but in a vectorised fashion. Note how 256 bits / 8 bytes of data are being handled in each iteration.

vmovdqu ymm1, ymmword ptr [r9 + 4*rdi]  # loading 256 bits from rhs
vpsubd ymm1, ymm1, ymm0                 # ymm0 has all bits set, like +1
vpsrld ymm1, ymm1, 1                    # vector shift right.
vmovdqu ymmword ptr [rax + 4*rdi], ymm1 # storing result
add rdi, 8                              # incrementing i by 8
cmp rsi, rdi
jne <next iteration>

I've built the following regular expression pattern to help match any AVX or AVX2 instructions being used. I printed the pattern out across multiple lines for easier desktop skimming.

$ vi avx_patterns.tmp

vaddpd\|vaddps\|vaddsubpd\|vaddsubps\|vandnpd\|vandnps\|vandpd\|
vandps\|vblendpd\|vblendps\|vblendvpd\|vblendvps\|vbroadcastf128\|
vbroadcasti128\|vbroadcastsd\|vbroadcastss\|vcmppd\|vcmpps\|vcmpsd\|
vcmpss\|vcvtdq2pd\|vcvtdq2ps\|vcvtpd2dq\|vcvtpd2ps\|vcvtps2dq\|
vcvtps2pd\|vcvttpd2dq\|vcvttps2dq\|vdivpd\|vdivps\|vdpps\|
vextractf128\|vextracti128\|vgatherdpd\|vgatherdps\|vgatherqpd\|
vgatherqps\|vhaddpd\|vhaddps\|vhsubpd\|vhsubps\|vinsertf128\|
vinserti128\|vlddqu\|vmaskmovpd\|vmaskmovps\|vmaxpd\|vmaxps\|vminpd\|
vminps\|vmovapd\|vmovaps\|vmovddup\|vmovddup\|vmovdqa\|vmovdqu\|
vmovmskpd\|vmovmskps\|vmovntdq\|vmovntdqa\|vmovntpd\|vmovntps\|
vmovshdup\|vmovsldup\|vmovupd\|vmovups\|vmpsadbw\|vmulpd\|vmulps\|
vorpd\|vorps\|vpabsb\|vpabsd\|vpabsw\|vpackssdw\|vpacksswb\|
vpackusdw\|vpackuswb\|vpaddb\|vpaddd\|vpaddq\|vpaddsb\|vpaddsw\|
vpaddusb\|vpaddusw\|vpaddw\|vpalignr\|vpand\|vpandn\|vpavgb\|
vpavgw\|vpblendd\|vpblendvb\|vpblendw\|vpbroadcastb\|vpbroadcastd\|
vpbroadcastq\|vpbroadcastw\|vpcmpeqb\|vpcmpeqd\|vpcmpeqq\|vpcmpeqw\|
vpcmpgtb\|vpcmpgtd\|vpcmpgtq\|vpcmpgtw\|vperm2f128\|vperm2i128\|
vpermd\|vpermilpd\|vpermilps\|vpermpd\|vpermps\|vpermq\|vpgatherdd\|
vpgatherdq\|vpgatherqd\|vpgatherqq\|vphaddd\|vphaddsw\|vphaddw\|
vphsubd\|vphsubsw\|vphsubw\|vpmaddubsw\|vpmaddwd\|vpmaskmovd\|
vpmaskmovq\|vpmaxsb\|vpmaxsd\|vpmaxsw\|vpmaxub\|vpmaxud\|vpmaxuw\|
vpminsb\|vpminsd\|vpminsw\|vpminub\|vpminud\|vpminuw\|vpmovmskb\|
vpmovsxbd\|vpmovsxbq\|vpmovsxbw\|vpmovsxdq\|vpmovsxwd\|vpmovsxwq\|
vpmovzxbd\|vpmovzxbq\|vpmovzxbw\|vpmovzxdq\|vpmovzxwd\|vpmovzxwq\|
vpmuldq\|vpmulhrsw\|vpmulhuw\|vpmulhw\|vpmulld\|vpmullw\|vpmuludq\|
vpor\|vpsadbw\|vpshufb\|vpshufd\|vpshufhw\|vpshuflw\|vpsignb\|vpsignd\|
vpsignw\|vpslld\|vpslldq\|vpsllq\|vpsllvd\|vpsllvq\|vpsllw\|vpsrad\|
vpsravd\|vpsraw\|vpsrld\|vpsrldq\|vpsrlq\|vpsrlvd\|vpsrlvq\|vpsrlw\|
vpsubb\|vpsubd\|vpsubq\|vpsubsb\|vpsubsw\|vpsubusb\|vpsubusw\|vpsubw\|
vptest\|vpunpckhbw\|vpunpckhdq\|vpunpckhqdq\|vpunpckhwd\|vpunpcklbw\|
vpunpckldq\|vpunpcklqdq\|vpunpcklwd\|vpxor\|vpxor\|vrcpps\|vroundpd\|
vroundps\|vrsqrtps\|vshufpd\|vshufps\|vsqrtpd\|vsqrtps\|vsubpd\|
vsubps\|vtestpd\|vtestps\|vunpckhpd\|vunpckhps\|vunpcklpd\|vunpcklps\|
vxorpd\|vxorps\|vzeroall\|vzeroupper

I'll remove the new lines for grep compatibility.

$ cat avx_patterns.tmp | tr -d '\n' > avx_patterns

Only the Golang Archiver utilities appears to be using any AVX opcodes.

$ objdump -d ~/arc                          | grep -cf avx_patterns # 3539
$ objdump -d /usr/bin/unzip                 | grep -cf avx_patterns # 0
$ objdump -d /usr/lib/p7zip/7z              | grep -cf avx_patterns # 0
$ objdump -d /bin/gzip                      | grep -cf avx_patterns # 0
$ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf avx_patterns # 0
$ objdump -d /usr/bin/python2.7             | grep -cf avx_patterns # 0
$ objdump -d /home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \
    | grep -cf avx_patterns # 0

I'll then look for signs of SSE 4.2 opcodes in use in any of the binaries benchmarked in this post.

$ vi sse42_patterns

crc32[qwlb]\|pcmpestri\|pcmpestrm\|pcmpistri\|pcmpistrm\|pcmpgtq

$ objdump -d ~/arc                          | grep -cf sse42_patterns # 24
$ objdump -d /usr/bin/unzip                 | grep -cf sse42_patterns # 0
$ objdump -d /usr/lib/p7zip/7z              | grep -cf sse42_patterns # 0
$ objdump -d /bin/gzip                      | grep -cf sse42_patterns # 0
$ objdump -d /home/mark/unzip/out/bin/unzip | grep -cf sse42_patterns # 0
$ objdump -d /usr/bin/python2.7             | grep -cf sse42_patterns # 0
$ objdump -d /home/mark/zlib-ng/out/lib/libz.so.1.2.11.zlib-ng \
    | grep -cf sse42_patterns # 5

Both Archiver and zlib-ng have instances of SSE 4.2 opcodes. The above isn't proof that those opcodes were executed in the code paths followed during the ZIP extraction process but an absence of them proves there is no support at all for these optimisations.

It's also worth noting that CRC-32 checksums, which have a dedicated op code in SSE 4.2, are used with GZIP compression in zlib but Alder-32 is used for DEFLATE compression. The instances of CRC-32-related opcodes above prove their existence in the binary but can be presumed to go unused in the code path the unzip process took.

Alder-32 was generally able to outperform CRC-32 at the expense of accuracy but that changed when Intel released a CRC-32-specific op code in the SSE 4.2 instruction set additions a few years ago. Intel published research two years ago where they worked to improve the performance of an Alder-32 implementation using AVX2. Though not measured in the research the firm believes AVX512 could improve on their findings further. There have been similar efforts on ARM chips in recent years as well.

After my initial publication, Klaus Post reached out and noted that the DEFLATE bitstreams seen in the above ZIP file don't contain checksums. There are CRC-32 checksums contained within the ZIP file structure itself but not within the bitstreams of compressed data. This means checksums are unlikely to be a major source of computational overhead.

He also noted that the inflating of these streams is done byte by byte with each value dependant on the previous value. This means Intel's vectorisation instructions aren't going to be of great use for this workload. That being said, the Bit Manipulation Instruction Sets 1 and 2, which were first introduced in Intel CPUs using the Haswell microarchitecture in 2014, could help accelerate DEFLATE decompression. I've yet to prove these are present and contributing to zlib-ng's strong performance.

Conclusion

Despite Archiver's support for AVX and SSE optimisations, it achieved the slowest time among the tools benchmarked. Klaus Post's feedback above goes some way to explaining that the presence of these instructions might not do much for the bulk of any given decompression workload. I'm still surprised by the performance gap and this is still a point of further research.

CPython was able to show healthy improvements when zlib 1.2.0 was swapped out for zlib-ng. Prefixing the Python command is a pretty non-invasive way of taking advantage of these speedups.

I have a habit of accepting whichever software version apt install delivers but in the case of 7-Zip there was a significant performance increase to be had simply by looking for a more recent .deb file.

The fastest time seen in this benchmark represents 5.46 MB/s of ZIP content being decompressed. Given this is two orders-of-magnitude off of bottlenecking mechanical storage there is still a long way to go before compute optimisations are no longer a concern.

Thank you for taking the time to read this post. I offer both consulting and hands-on development services to clients in North America and Europe. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn.