hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Performance at large number of regions/node
Date Sat, 29 May 2010 04:04:01 GMT
> What I wanted out of this discussion was to find out whether I am in the
> ballpark of what I can juice out of HBase or I am way off the mark.

I understand... but this is a distributed system we're talking about.
Unless I have the same code, hbase/hadoop version, configuration,
number of nodes, cpu, RAM, # of HDDs, OS, network equipment, data set,
etc... it's really hard to assess right? For starters, I don't think
you specified the number of drives you have per machine, and HBase is
mostly IO-bound.

FWIW, here's our experience. At StumbleUpon, we uploaded our main data
set consisting of 13B*2 rows on 20 machines (2xi7, 24GB (8 for HBase),
4x 1TB JBOD) with MapReduce (using 8 maps per machine) pulling from a
MySQL cluster (we were selecting large ranges in batches), inserting
at an average rate of 150-200k rows per second, peaks at 1M. Our rows
are a few bytes, mostly integers and some text. We did it in the time
with HBase 0.20.3 + the parallel-put patch we wrote here (available in
trunk) with the configuration I pasted previously. For that upload the
WAL was disabled and ALL our tables are LZOed (can't stress enough the
importance of compressing your tables!) and 1GB max file size.

My guess is yes you can juice it out more, first by using LZO ;)

Also, are your machines even stressed during the test? Do you monitor?
Could you increase the number of clients?

Sorry I can't give you a very clear answer, but without using a common
benchmark to compare numbers we're pretty much all in the dark. YCSB
is one, but IIRC it needs some patches to work efficiently (Todd
Lipcon from Cloudera has them in his github).


View raw message