Compression algorithms are designed to make trade-offs in order to optimise for certain applications at the expense of others. The four major points of measurement are (1) compression time (2) compression ratio (3) decompression time and (4) RAM consumption.
If you're releasing a large software patch, optimising the compression ratio and decompression time would be more in the users' interest. But, if the payload is already encrypted or wrapped in a digital rights management container, compression is unlikely to achieve a strong compression ratio so decompression time should be the primary goal.
S2 is an extension of Snappy, a compression library Google first released back in 2011. Snappy originally made the trade-off going for faster compression and decompression times at the expense of higher compression ratios. Snappy has been popular in the data world with containers and tools like ORC, Parquet, ClickHouse, BigQuery, Redshift, MariaDB, Cassandra, MongoDB, Lucene and bcolz all offering support. S2 aims to further improve throughput with concurrent compression for larger payloads.
S2 is also smart enough to save CPU cycles on content that is unlikely to achieve a strong compression ratio. Encrypted, random and data that is already compressed are examples that will often cause compressors to waste CPU cycles with little to show for their efforts.
S2 can be a drop-in replacement for Snappy but for top performance, it shouldn't compress using the backward compatibility mode.
S2 Up & Running
I'll first install GoLang. The following will run on Ubuntu 20.04 LTS.
$ sudo add-apt-repository ppa:longsleep/golang-backports $ sudo apt update $ sudo apt install golang-go
If you're using macOS, you can install GoLang via Homebrew.
$ brew install go
Regardless of the platform, the following will install pre-compiled binaries for S2.
$ go install -u github.com/klauspost/compress/s2/cmd/s2c@latest $ go install -u github.com/klauspost/compress/s2/cmd/s2d@latest
The above were install to /Users/mark/go/bin on my MacBook Pro. I made sure GoLang's binary folder was in my PATH environment variable so I could address the binaries without a path.
$ grep PATH ~/.bashrc
$ source ~/.bashrc
Wikipedia produces a dump of their site's contents every few days. I'll pull down one of the dump's 239 MB, bzip2-compressed segments. I'll also install lbzip2, a multi-threaded bzip2 compression utility.
On Ubuntu 20.04 LTS, run the following.
$ sudo apt-install lbzip2
On macOS, run the following.
$ brew install lbzip2
The following will fetch the archive from Wikipedia.
$ wget -c https://dumps.wikimedia.org/enwiki/20210801/enwiki-20210801-pages-articles1.xml-p1p41242.bz2
I'll decompress the 879 MB XML file within the bzip2 archive.
$ lbunzip2 --keep enwiki-20210801-pages-articles1.xml-p1p41242.bz2
The following compressed at a rate of 237.9 MB/s. The resulting contents are 43.67% the size of the original decompressed XML.
$ s2c enwiki-20210801-pages-articles1.xml-p1p41242
The throughput speed is about half of my Thunderbolt-connected SSD's potential and I could see all 8 CPU cores of my MacBook Pro being utilised during the above operation. The compression ratio trade-off is very prominent with the bzip2 archive at 239 MB while the new S2 archive is 384 MB. If the archive were to only travel across local networks this could be ignorable but for anything distributed to the wider world, this ratio would be hard to justify.
Compressing the already bzip2-compressed archive reduces it by 490 bytes and at a rate of 188.8MB/s. The throughput rate for re-compression was only 1.26x slower than compressing the original material instead. This is great as compressors could waste a lot of time looking for patterns in data where there are few.
$ s2c enwiki-20210801-pages-articles1.xml-p1p41242.bz2
Given the lack of any significant throughput penalty, one could consider forgoing entropy tests when picking compressor settings on diverse workloads.
Decompression times are very much in favour of S2. The bzip2 archive took 8.24 seconds to decompress and have its bytes counted. This was at a rate of 29 MB/s on its 239 MB of source data.
$ lbunzip2 --keep --stdout enwiki-20210801-pages-articles1.xml-p1p41242.bz2 | wc -c
The s2d utility managed to do the same in 3.86 seconds, a 2.13x speed-up and a throughput rate of 99.5 MB/s on its 384 MB of source data.
$ s2d enwiki-20210801-pages-articles1.xml-p1p41242.s2 | wc -c
Google's Snappy Bindings
Google provides Snappy bindings for several languages, including Python. Below I'll install these alongside Python 3 on my MacBook Pro.
$ brew install \ snappy \ virtualenv $ virtualenv ~/.snappy $ source ~/.snappy/bin/activate $ python3 -m pip install \ python-snappy
The above installed, among 481 lines of Python-based helper functions, a 130 KB binary compiled from C code that's used for Snappy compression and decompression operations.
$ ls -alht ~/.snappy/lib/python3.9/site-packages/snappy/*.so
-rwxr-xr-x 1 mark staff 130K Aug 20 14:11 /Users/mark/.snappy/lib/python3.9/site-packages/snappy/_snappy.cpython-39-darwin.so
The following compressed at a rate of 56.38 MB/s, 4.2x slower than S2 did out of the box. The resulting file is 466 MB in size, about 1.2x larger than what S2 produced.
$ python3 -m snappy -c \ enwiki-20210801-pages-articles1.xml-p1p41242 \ enwiki-20210801-pages-articles1.xml-p1p41242.snappy
The following decompressed and had its bytes counted at a rate of 58.62 MB/s. This is ~1.7x slower throughput-wise than what was seen with S2.
$ python3 -m snappy -d \ enwiki-20210801-pages-articles1.xml-p1p41242.snappy | wc -c