hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: GZ better than LZO?
Date Thu, 18 Aug 2011 18:51:55 GMT
You're definitely going to want to use the native libraries for zlib and gzip.

http://hadoop.apache.org/common/docs/current/native_libraries.html

It's actually a fairly easy build, and it comes out of the box with CDH IIRC.  You can put
a symlink to hadoop/lib/native in hbase/lib and you're done.

When HBase falls back to Java for GZ and zlib, it will definitely be a bad thing =/

Sandy


> -----Original Message-----
> From: BlueDavy Lin [mailto:bluedavy@gmail.com]
> Sent: Wednesday, August 17, 2011 19:07
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> We test gz also,but when we use gz,it seems will cause memory out of
> usage.
> 
> It seems maybe because gz not use Deflater/Inflater correctly (not call end
> method explicit)
> 
> 2011/8/18 Sandy Pratt <prattrs@adobe.com>:
> > I also switched from LZO to GZ a while back.  I didn't do any micro-
> benchmarks, but I did note that the overall time of some MR jobs on our
> small cluster (~2B records at the time IIRC) went down slightly after the
> change.
> >
> > The primary reason I switched was not due to performance, however, but
> due to compression ratio and licensing/build issues.  AFAIK, the GZ code is
> branched, tested and released along with Hadoop, whereas LZO wasn't
> when I last used it (not an academic concern, it turned out).
> >
> > One speculation about where the discrepancy between micro-benchmarks
> and actual use may arise: do benchmarks include the cost of marshaling the
> data (64MB before compression region say) from disk?  If the benchmark
> starts with the data in memory (and how do you know if it does or not, given
> the layers of cache between you and the platters) then it might not reflect
> real world HBase scenarios.  GZ may need to read only 20MB while LZO might
> need to read 32MB.  Does that difference dominate the computational cost
> of decompression?
> >
> >
> > Sandy
> >
> >
> >> -----Original Message-----
> >> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
> >> Sent: Friday, July 29, 2011 08:44
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> For what's it worth I had similar observations.
> >>
> >> I simulated heavy write load and I found that NO compression was the
> >> fastest, followed by GZ, followed by LZO.
> >> After the tests I did a major_compact of the tables, and I included
> >> that time in the total.
> >> Also these tests where done with a single region server, in order to
> >> isolate compression performance better.
> >>
> >>
> >> So at least you're not the only one seeing this :) However, it seems
> >> that this heavily depends on the details of your setup (relative CPU
> >> vs IO performance, for example).
> >>
> >>
> >> ----- Original Message -----
> >> From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Thursday, July 28, 2011 11:27 PM
> >> Subject: RE: GZ better than LZO?
> >>
> >> Hello,
> >>
> >> we simulated real looking data (as in our expected production system)
> >> in respect to row-key, column families ...
> >>
> >> The test client (TDG) basically implement a three-part row key.
> >>
> >> vehicle-device-reversedtimestamp
> >>
> >> vehicle: 16 characters, left-padded with "0"
> >> device: 16 characters, left-padded with "0"
> >> reversedtimestamp: YYYYMMDDhhmmss
> >>
> >> There are four column families, although currently only one called
> >> "data_details" is filled by the TDG. The others are reserved for later use.
> >> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
> >>
> >> The qualifiers for "data_details" are basically based on an enum with
> >> 25 members. And each member has three occurrences, defined by adding
> >> a different suffix to the qualifier name.
> >>
> >> Let's say, there is an enum member called "temperature1", then there
> >> are the following qualifiers used:
> >>
> >> temperature1_value
> >> temperature1_unit
> >> temperature1_validity
> >>
> >> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
> >> values in a range from [0, 65535] each.
> >>
> >> TDG basically allows to define the number of simulated clients (one
> >> thread per client), enabled to run them in multi-threaded mode or in
> >> single- threaded mode. Data volume is defined by number of iterations
> >> of the set of simulated clients, the number of iterations per client,
> >> number of devices per client and number of rows per device.
> >>
> >> After the test has finished, 1.008.000 rows were inserted and
> >> successfully replicated to our backup test cluster.
> >>
> >> Any further ideas?
> >>
> >> PS: We are currently running a test with ~ 4mio rows following the
> >> pattern above.
> >>
> >> Thanks,
> >> Thomas
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Chiku [mailto:hakisenin@gmail.com]
> >> Sent: Donnerstag, 28. Juli 2011 15:35
> >> To: user@hbase.apache.org
> >> Subject: Re: GZ better than LZO?
> >>
> >> Are you getting this results because of the nature of test data generated?
> >>
> >> Would you mind sharing some details about the test client and the
> >> data it generates?
> >>
> >>
> >> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
> >> Thomas.Steinmaurer@scch.at> wrote:
> >>
> >> > Hello,
> >> >
> >> >
> >> >
> >> > we ran a test client generating data into GZ and LZO compressed table.
> >> > Equal data sets (number of rows: 1008000 and the same table
> >> > schema). ~
> >> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ
> >> > is
> >>
> >> > ~
> >> > 444 MB, so basically half of LZO.
> >> >
> >> >
> >> >
> >> > Execution time of the data generating client was 1373 seconds into
> >> > the
> >>
> >> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The
> >> > data
> >>
> >> > generation client is based on HTablePool and using batch operations.
> >> >
> >> >
> >> >
> >> > So in our (simple) test, GZ beats LZO in both, disk usage and
> >> > execution time of the client. We haven't tried reads yet.
> >> >
> >> >
> >> >
> >> > Is this an expected result? I thought LZO is the recommended
> >> > compression algorithm? Or does LZO outperforms GZ with a growing
> >> > amount of data or in read scenarios?
> >> >
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Thomas
> >> >
> >> >
> >> >
> >> >
> >
> >
> 
> 
> 
> --
> =============================
> |     BlueDavy                                      |
> |     http://www.bluedavy.com                |
> =============================

Mime
View raw message