hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steinmaurer Thomas" <Thomas.Steinmau...@scch.at>
Subject RE: GZ better than LZO?
Date Thu, 18 Aug 2011 13:02:43 GMT
Ah, sorry. 550 mio. rows and not billions.

Thomas

-----Original Message-----
From: Steinmaurer Thomas [mailto:Thomas.Steinmaurer@scch.at] 
Sent: Donnerstag, 18. August 2011 14:55
To: user@hbase.apache.org
Subject: RE: GZ better than LZO?

After our tests with ~550 bill. rows, we probably will go with Snappy. Our test showed better
write performance compared to GZ and LZO, with only slightly more disk usage compared to LZO.


Haven't looked at comparing read performance for our pattern, but performance of Snappy should
be sufficient here as well.

Regards,
Thomas

-----Original Message-----
From: BlueDavy Lin [mailto:bluedavy@gmail.com]
Sent: Donnerstag, 18. August 2011 04:06
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

We test gz also,but when we use gz,it seems will cause memory out of usage.

It seems maybe because gz not use Deflater/Inflater correctly (not call end method explicit)

2011/8/18 Sandy Pratt <prattrs@adobe.com>:
> I also switched from LZO to GZ a while back.  I didn't do any micro-benchmarks, but
I did note that the overall time of some MR jobs on our small cluster (~2B records at the
time IIRC) went down slightly after the change.
>
> The primary reason I switched was not due to performance, however, but due to compression
ratio and licensing/build issues.  AFAIK, the GZ code is branched, tested and released along
with Hadoop, whereas LZO wasn't when I last used it (not an academic concern, it turned out).
>
> One speculation about where the discrepancy between micro-benchmarks and actual use may
arise: do benchmarks include the cost of marshaling the data (64MB before compression region
say) from disk?  If the benchmark starts with the data in memory (and how do you know if
it does or not, given the layers of cache between you and the platters) then it might not
reflect real world HBase scenarios.  GZ may need to read only 20MB while LZO might need to
read 32MB.  Does that difference dominate the computational cost of decompression?
>
>
> Sandy
>
>
>> -----Original Message-----
>> From: lars hofhansl [mailto:lhofhansl@yahoo.com]
>> Sent: Friday, July 29, 2011 08:44
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> For what's it worth I had similar observations.
>>
>> I simulated heavy write load and I found that NO compression was the 
>> fastest, followed by GZ, followed by LZO.
>> After the tests I did a major_compact of the tables, and I included 
>> that time in the total.
>> Also these tests where done with a single region server, in order to 
>> isolate compression performance better.
>>
>>
>> So at least you're not the only one seeing this :) However, it seems 
>> that this heavily depends on the details of your setup (relative CPU 
>> vs IO performance, for example).
>>
>>
>> ----- Original Message -----
>> From: Steinmaurer Thomas <Thomas.Steinmaurer@scch.at>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Thursday, July 28, 2011 11:27 PM
>> Subject: RE: GZ better than LZO?
>>
>> Hello,
>>
>> we simulated real looking data (as in our expected production system) 
>> in respect to row-key, column families ...
>>
>> The test client (TDG) basically implement a three-part row key.
>>
>> vehicle-device-reversedtimestamp
>>
>> vehicle: 16 characters, left-padded with "0"
>> device: 16 characters, left-padded with "0"
>> reversedtimestamp: YYYYMMDDhhmmss
>>
>> There are four column families, although currently only one called 
>> "data_details" is filled by the TDG. The others are reserved for later use.
>> Replication (REPLICATION_SCOPE = 1) is enabled for all column families.
>>
>> The qualifiers for "data_details" are basically based on an enum with
>> 25 members. And each member has three occurrences, defined by adding 
>> a different suffix to the qualifier name.
>>
>> Let's say, there is an enum member called "temperature1", then there 
>> are the following qualifiers used:
>>
>> temperature1_value
>> temperature1_unit
>> temperature1_validity
>>
>> So, we end up with 25 * 3 = 75 qualifiers per row, filled with random 
>> values in a range from [0, 65535] each.
>>
>> TDG basically allows to define the number of simulated clients (one 
>> thread per client), enabled to run them in multi-threaded mode or in
>> single- threaded mode. Data volume is defined by number of iterations 
>> of the set of simulated clients, the number of iterations per client, 
>> number of devices per client and number of rows per device.
>>
>> After the test has finished, 1.008.000 rows were inserted and 
>> successfully replicated to our backup test cluster.
>>
>> Any further ideas?
>>
>> PS: We are currently running a test with ~ 4mio rows following the 
>> pattern above.
>>
>> Thanks,
>> Thomas
>>
>>
>>
>> -----Original Message-----
>> From: Chiku [mailto:hakisenin@gmail.com]
>> Sent: Donnerstag, 28. Juli 2011 15:35
>> To: user@hbase.apache.org
>> Subject: Re: GZ better than LZO?
>>
>> Are you getting this results because of the nature of test data generated?
>>
>> Would you mind sharing some details about the test client and the 
>> data it generates?
>>
>>
>> On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas < 
>> Thomas.Steinmaurer@scch.at> wrote:
>>
>> > Hello,
>> >
>> >
>> >
>> > we ran a test client generating data into GZ and LZO compressed table.
>> > Equal data sets (number of rows: 1008000 and the same table 
>> > schema). ~
>> > 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ 
>> > is
>>
>> > ~
>> > 444 MB, so basically half of LZO.
>> >
>> >
>> >
>> > Execution time of the data generating client was 1373 seconds into 
>> > the
>>
>> > uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The 
>> > data
>>
>> > generation client is based on HTablePool and using batch operations.
>> >
>> >
>> >
>> > So in our (simple) test, GZ beats LZO in both, disk usage and 
>> > execution time of the client. We haven't tried reads yet.
>> >
>> >
>> >
>> > Is this an expected result? I thought LZO is the recommended 
>> > compression algorithm? Or does LZO outperforms GZ with a growing 
>> > amount of data or in read scenarios?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Thomas
>> >
>> >
>> >
>> >
>
>



--
=============================
|     BlueDavy                                      |
|     http://www.bluedavy.com                |
=============================

Mime
View raw message