commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Bodewig <bode...@apache.org>
Subject Re: [compress] Some Compressor Benchmarks
Date Mon, 27 Feb 2017 09:41:31 GMT
On 2017-02-25, Stefan Bodewig wrote:

> While I was at it I thought it would be nice to compare the performance
> of the compression formats we support:

>   https://github.com/bodewig/commons-compress-benchmarks/wiki/Comparison-of-Compression-Formats

> As expected the performance of the zlib based native gzip/deflate
> compressors/decompressors is superior to all others.  The performance of
> the pure Java LZ77 I've implemented for snappy and lz4 doesn't look too
> bad, but I'm surprised to see Snappy is so much faster than LZ4 (they
> share most of the code).

But not of the configuration space. I've started to run the benchmark
for a bigger file (Compress 1.13's uncompressed tar archive) and Snappy
degrades to the performance of bzip2 while lz4 gets even slower than
lzma.

What happens here is that the current code is more or less equivalent to
the "maximum compression" case of zlib (without lazy matches[1] which
would slow down things even more).

It searches backwards for matches until it finds one that is as long as
the maximum match (or data is exceeded). For Snappy maximum match is 64
bytes, for LZ4 it is 2^16-1 (for comparison, for DEFLATE it is 258). In
addition it stops searching backwards at a maximum distance from the
current position wich is 2^15 for Snappy (by default) and 2^16-1 for
lz4 (2^15-3 for DEFLATE).

So the current lz4 configuration tries a harder to find a lot longer
matches than Snappy does (or DEFLATE would do).

zlib has two parameters that tune the performance for the no-lazy match
case we've currently implemented.

* A "nice match length" that says stop searching once you've found a
  match of at least this length.

* a "maximum chain length" that puts a cap on the number of potential
  matches that are tried.

I plan to implement both of these but am a little sure about the
API. They'll go into lz77support.Parameter as optional properties. Of
course we will want people who know what they do to provide arbitrary
values but it would be nices to have something like

new Parameters(WINDOW_SIZE).tunedForBestCompression() or new
Parameters(WINDOW_SIZE).tunedForBestSpeed() but I'm unsure how many
level we need in between (zlib goes from 0 - no compression - to 9 -
maximum compression). And I'm also not sure there is a simple way of
picking the best possible parameters for arbitrary LZ77 variants.

Also, I'm not sure what the default should be for people using
CompressorStreamFactory.createCompressorOutputStream.

Ideas are more than welcome.

Stefan

[1] using lazy matches you find the best back-reference for the data
starting at the current byte and calculate the best back-reference if
you started with the next byte rather than the current byte and pick the
longer one. This is something I also want to add before the 1.14
release.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message