cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "" <>
Subject Re: Re: compression
Date Fri, 02 Apr 2010 02:03:14 GMT
	thanks to Rao and Tatu :)
	I will  test them and let you know what I found.
	Cao Jiguang

发件人:Tatu Saloranta
发送日期:2010-04-02 01:08:52
主题:Re: compression

On Thu, Apr 1, 2010 at 8:27 AM, Rao Venugopal <> wrote:
> To Cao Jiguang
> I was watching this presentation on bigtable yesterday
> and Jeff mentioned that they compared three different compression libraries
> BMDiff, LZO and gzip.牋 Apparently, gzip was the most cpu intensive and they
> ended up going with BMDiff.
> I didn't find any Open source / Free implementation of BMDiff but I found
> LZO.

Another IMO good alternative is LZF -- it has characteristics similar
to LZO. Gzip (i.e. deflate) is a two-phase compressor, with usual
lempel-ziv first, then huffman (oldest statistical encoding). LZO, LZF
and most other newer simpler but less compressing variants usually
only do lempel-ziv.
Why LZF? Because there are simple Java free+open implementations: H2
has codec, I ported it to Voldemort, and I think there was talk of
generalizing one from H2 as stand-alone codec for reuse. Possibly
others may have ported it for other libs/frameworks too (there were
multiple jira issues for adding some of these to hadoop). Block format
itself is simple, and it is possible to decode adjacent blocks
separately by skipping encoded blocks without decoding: this can be
used to allow some level of random access (access random block, decode
it, access something inside the block).

Performance-wise simpler codecs are fast enough to add less overhead
than fastest parsing of textual formats (json, xml), but more
importantly, they are MUCH faster to write (once again, not much more
overhead than format encoding). It is compression speed that really
kills gzip, esp. since it is often server that has to do it, for
small-requests, large-responses.

-+ Tatu +-
View raw message