incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tatu Saloranta <>
Subject Re: compression
Date Thu, 01 Apr 2010 17:08:29 GMT
On Thu, Apr 1, 2010 at 8:27 AM, Rao Venugopal <> wrote:
> To Cao Jiguang
> I was watching this presentation on bigtable yesterday
> and Jeff mentioned that they compared three different compression libraries
> BMDiff, LZO and gzip.   Apparently, gzip was the most cpu intensive and they
> ended up going with BMDiff.
> I didn't find any Open source / Free implementation of BMDiff but I found
> LZO.

Another IMO good alternative is LZF -- it has characteristics similar
to LZO. Gzip (i.e. deflate) is a two-phase compressor, with usual
lempel-ziv first, then huffman (oldest statistical encoding). LZO, LZF
and most other newer simpler but less compressing variants usually
only do lempel-ziv.
Why LZF? Because there are simple Java free+open implementations: H2
has codec, I ported it to Voldemort, and I think there was talk of
generalizing one from H2 as stand-alone codec for reuse. Possibly
others may have ported it for other libs/frameworks too (there were
multiple jira issues for adding some of these to hadoop). Block format
itself is simple, and it is possible to decode adjacent blocks
separately by skipping encoded blocks without decoding: this can be
used to allow some level of random access (access random block, decode
it, access something inside the block).

Performance-wise simpler codecs are fast enough to add less overhead
than fastest parsing of textual formats (json, xml), but more
importantly, they are MUCH faster to write (once again, not much more
overhead than format encoding). It is compression speed that really
kills gzip, esp. since it is often server that has to do it, for
small-requests, large-responses.

-+ Tatu +-

View raw message