hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steinmaurer Thomas" <Thomas.Steinmau...@scch.at>
Subject RE: GZ better than LZO?
Date Fri, 29 Jul 2011 06:27:42 GMT

we simulated real looking data (as in our expected production system) in
respect to row-key, column families ...

The test client (TDG) basically implement a three-part row key.


vehicle: 16 characters, left-padded with "0"
device: 16 characters, left-padded with "0"
reversedtimestamp: YYYYMMDDhhmmss

There are four column families, although currently only one called
"data_details" is filled by the TDG. The others are reserved for later
use. Replication (REPLICATION_SCOPE = 1) is enabled for all column

The qualifiers for "data_details" are basically based on an enum with 25
members. And each member has three occurrences, defined by adding a
different suffix to the qualifier name.

Let's say, there is an enum member called "temperature1", then there are
the following qualifiers used:


So, we end up with 25 * 3 = 75 qualifiers per row, filled with random
values in a range from [0, 65535] each.

TDG basically allows to define the number of simulated clients (one
thread per client), enabled to run them in multi-threaded mode or in
single-threaded mode. Data volume is defined by number of iterations of
the set of simulated clients, the number of iterations per client,
number of devices per client and number of rows per device.

After the test has finished, 1.008.000 rows were inserted and
successfully replicated to our backup test cluster.

Any further ideas?

PS: We are currently running a test with ~ 4mio rows following the
pattern above.


-----Original Message-----
From: Chiku [mailto:hakisenin@gmail.com] 
Sent: Donnerstag, 28. Juli 2011 15:35
To: user@hbase.apache.org
Subject: Re: GZ better than LZO?

Are you getting this results because of the nature of test data

Would you mind sharing some details about the test client and the data
it generates?

On Thu, Jul 28, 2011 at 7:01 PM, Steinmaurer Thomas <
Thomas.Steinmaurer@scch.at> wrote:

> Hello,
> we ran a test client generating data into GZ and LZO compressed table.
> Equal data sets (number of rows: 1008000 and the same table schema). ~
> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is

> ~
> 444 MB, so basically half of LZO.
> Execution time of the data generating client was 1373 seconds into the

> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data

> generation client is based on HTablePool and using batch operations.
> So in our (simple) test, GZ beats LZO in both, disk usage and 
> execution time of the client. We haven't tried reads yet.
> Is this an expected result? I thought LZO is the recommended 
> compression algorithm? Or does LZO outperforms GZ with a growing 
> amount of data or in read scenarios?
> Regards,
> Thomas

View raw message