hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: Compression
Date Fri, 16 Sep 2011 21:15:21 GMT
I run gzip in production, mostly because we have no requirement for random access, and the
improved compression ratio is a big win for our application.

The other day, I ran some tests between gzip and LZO to try and get some numbers about what
performance we might or might not be missing.  What I found is that performance is generally
a wash, so I'm happy to continue with the better compression ratio and simpler build of the
gzip native libs.

Details of the test:

I took 2 months of records from our test environment (13,499,320 to be exact, about 1kB each)
and copied them over to new tables. One table was compressed with gzip, the other with LZO.
 Then I compacted each table.  The gzip table wound up using 11 regions to store the data,
while the LZO table used 15 regions (region size and block size settings are all default).
 I then ran a simple map reduce job against each table.  As you would expect, the job against
the gzip table finished more quickly because it had fewer maps to do (this is a 2-node test
cluster it had to run the maps in series).

Best times of several trials were:

LZO: 2mins, 24sec
Gzip: 2mins, 21sec

Obviously, this is a tiny dataset on a tiny cluster, but I don't see any reason why the results
wouldn't hold up in a real world setting, all else being equal (at least for my workload).
 I'll probably continue to use it (with native libs - falling back to Java for gzip is terrible)
and maybe look at snappy in the future if our requirements change.

Sandy


-----Original Message-----
From: Wayne [mailto:wav100@gmail.com] 
Sent: Wednesday, September 14, 2011 5:34 AM
To: user@hbase.apache.org
Subject: Compression

I wanted to do a poll on what compression libraries people are using and why. We currently
use lzo but are considering other alternatives for various reasons. We would like to move
to CDH3 but adding lzo ourselves is a hassle we are not looking to take on. It kind of defeats
the purpose os using CDH3 to begin with. We current run 20.0 append.

I know there are a lot of variables that affect the best decision, but we are looking for
general trends in the community.

Is lzo still the most recommended? Is there benefit in using the lzo professional library
and does anyone use this?
Is snappy just as good as lzo and a lot easier to deal with in term of node build/releases?
Does zlib/gzip have any traction?

Compression ratios are important but as always performance/speed is our biggest requirement.
What are people using and why? Where is the momentum going? Compression is a huge benefit
of hadoop/hbase and having high compression ratios with solid performance is a major benefit.

Any recommendations would be appreciated.

Thanks.

Mime
View raw message