hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@email.com>
Subject Re: GZ better than LZO?
Date Fri, 29 Jul 2011 16:48:45 GMT
Your region distribution across the nodes is not great, for both cases most of your data is
going to one server, spreading the regions out across multiple servers would be best.

How many different vehicle_ids are being used, and are they all sequential integers in your
tests? Hbase performs better when not doing sequential inserts. You could try reversing the
vehicle ids to get around that (see the many discussions on the list about using reverse timestamps
as a rowkey)

Looking at your key construction I would suggest, unless your app requires it, to not left-pad
 your ids with zeros and rather use a delimiter between the key components. That will lead
to smaller keys, if you use a tab as your delimiter that character falls before all other
alphanumeric and punctuation characters (other than LF, CR, etc - characters that should not
be in your IDs) so the keys will sort the same and left padded ones. 

I've had good luck with converting sequential numeric IDs to base 64 and then reversing them
- that leads to very good key distribution across regions and shorter keys for any given number.
Another option - if you don't care if your rowkeys are plaintext, is to convert the IDs to
binary numbers and then reverse the bytes - that would be the most compact. If you do that
you would go back to not using delimiters and just have fixed offsets for each component.

Once you have a rowkey design you can then go ahead and create your tables pre-split with
multiple empty regions. That should perform much better over all for inserts, especially when
the DB is new and empty to start.

How did the load with 4 million records perform?

-chris

On Jul 29, 2011, at 12:36 AM, Steinmaurer Thomas wrote:

> Hi Chris!
> 
> Your questions are somehow hard to answer for me, because I'm not really
> in charge for the test cluster from an administration/setup POV.
> 
> Basically, when running:
> http://xxx:60010/master.jsp
> 
> I see 7 region servers. Each with a "maxHeap" value of 995.
> 
> When clicking on the different tables depending on the compression type,
> I get the following information:
> 
> GZ compressed table: 3 regions hosted by one region server
> LZO compressed table: 8 regions hosted by two region servers, where the
> start region is hosted by one region server and all other 7 regions are
> hosted on the second region server
> 
> Regarding the insert pattern etc... please have a look on my reply to
> Chiku, where I describe the test data generator and the table layout etc
> ... a bit.
> 
> Thanks,
> Thomas
> 
> -----Original Message-----
> From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris
> Tarnas
> Sent: Donnerstag, 28. Juli 2011 19:43
> To: user@hbase.apache.org
> Subject: Re: GZ better than LZO?
> 
> During the load did you add enough data to do a flush or compaction? P,
> In our cluster that amount of data inserted would not necessarily be
> enough to actually flush store files. Performance really depends on how
> the table's regions are laid out, the insert pattern, the number of
> regionservers and the amount of RAM allocated to each regionserver. If
> you don't see any flushes or compactions in the log try repeating that
> test and then flushing the table and do a compaction (or add more data
> so it happens automatically) and timing everything. It would be
> interesting to see if the GZ benefit holds up.
> 
> -chris
> 
> On Jul 28, 2011, at 6:31 AM, Steinmaurer Thomas wrote:
> 
>> Hello,
>> 
>> 
>> 
>> we ran a test client generating data into GZ and LZO compressed table.
>> Equal data sets (number of rows: 1008000 and the same table schema). ~
>> 7.78 GB disk space uncompressed in HDFS. LZO is ~ 887 MB whereas GZ is
> 
>> ~
>> 444 MB, so basically half of LZO.
>> 
>> 
>> 
>> Execution time of the data generating client was 1373 seconds into the
> 
>> uncompressed table, 3374 sec. into LZO and 2198 sec. into GZ. The data
> 
>> generation client is based on HTablePool and using batch operations.
>> 
>> 
>> 
>> So in our (simple) test, GZ beats LZO in both, disk usage and 
>> execution time of the client. We haven't tried reads yet.
>> 
>> 
>> 
>> Is this an expected result? I thought LZO is the recommended 
>> compression algorithm? Or does LZO outperforms GZ with a growing 
>> amount of data or in read scenarios?
>> 
>> 
>> 
>> Regards,
>> 
>> Thomas
>> 
>> 
>> 
> 


Mime
View raw message