hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Molinari, Guy" <Guy.Molin...@disney.com>
Subject Hbase and linear scaling with small write intensive clusters
Date Tue, 22 Sep 2009 22:14:11 GMT
Hello all,

     I've been working with HBase for the past few months on a proof of
concept/technology adoption evaluation.    I wanted to describe my
scenario to the user/development community to get some input on my
observations.    

 

I've written an application that is comprised of two tables.  It models
a classic many-to-many relationship.   One table stores "User" data and
the other represents an "Inbox" of items assigned to that user.    The
key for the user is a string generated by the JDK's UUID.randomUUID()
method.   The key for the "Inbox" is a monotonically increasing value.

 

It works just fine.   I've reviewed the performance tuning info on the
HBase WIKI page.   The client application spins up 100 threads each one
grabbing a range of keys (for the "Inbox").    The I/O mix is about
50/50 read/write.   The test client inserts 1,000,000 "Inbox" items and
verifies the existence of a "User" (FK check).   It uses column families
to maintain integrity of the relationships.  

 

I'm running versions 0.19.3 and 0.20.0.    The behavior is basically the
same.   The cluster consists of 10 nodes.   I'm running my namenode and
HBase master on one dedicated box.   The other 9 run datanodes/region
servers.

 

I'm seeing around ~1000 "Inbox" transactions per second (dividing total
time for the batch by total count inserted).    The problem is that I
get the same results with 5 nodes as with 10.    Not quite what I was
expecting.   

 

The bottleneck seems to be the splitting algorithms.   I've set my
region size to 2M.   I can see that as the process moves forward, HBase
pauses and re-distributes the data and splits regions.   It does this
first for the "Inbox" table and then pauses again and redistributes the
"User" table.    This pause can be quite long.   Often 2 minutes or
more.

 

Can the key ranges be pre-defined somehow in advance to avoid this?   I
would rather not burden application developers/DBA's with this.
Perhaps the divvy algorithms could be sped up?   Any configuration
recommendations?

 

Thanks in advance,

Guy


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message