hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stan Barton <bartx...@gmail.com>
Subject Re: HTable.put hangs on bulk loading
Date Thu, 28 Apr 2011 13:54:43 GMT

Yes, these high limits are for the user running the hadoop/hbase processes.

The systems are ran on a cluster of 7 machines (1 master, 6 slaves). One
processor, two cores and 3.5GB of memory. I am using about 800MB for hadoop
(version CDH3B2) and 2.1GB for HBase (version 0.90.2). There is 6TB on four
disks per machine. Three zookeepers. The database contains more than 3500
regions and the table that was fed was already about 300 regions. The table
was fed incrementally using HTable.put().  The data are documents with size
ranging from few bytes to megabytes where the upper limit is set to 10MB per
inserted doc.

The configuration files:

hadoop/core-site.xml http://pastebin.ca/2051527
hadoop/hadoop-env.sh http://pastebin.ca/2051528
hadoop/hdfs-site.xml http://pastebin.ca/2051529

hbase/hbase-site.xml http://pastebin.ca/2051532
hbase/hbase-env.sh http://pastebin.ca/2051535

Because the nproc was high I had inspected the out files of the RSs' and
found one which indicated that all the IPCs OOMEd, unfortunately I dont have
those because they got overwritten after a cluster restart. So that means
that it was OK on the client side. Funny is that all RS processes were up
and running, only that one with OOMEd IPCs did not really communicate (after
trying to restart the importing process no inserts went through). So the
cluster seemed OK - I was storing statistics that were apparently served by
another RS and those were also listed OK. As I mentioned, the log of the bad
RS did not mention that anything wrong happened. 

My observation was: the regions were spread on all RSs but the crashed RS
served most of them about a half more than any other, therefore was accessed
the more than others. I have discussed the load balancing in HBase 0.90.2
with Ted Yu already. 

The balancer needs to be tuned I guess because when the table is created and
loaded from scratch, the regions of the table are not balanced equally (in
terms of numbers) in the cluster and I guess the RS that hosted the very
first region is serving the majority of servers as they are being split. It
imposes larger load on that RS which is more prone to failures (like mine
OOME) and kill performance.

I have resumed the process with rebalancing the regions beforehand and was
achieving higher data ingestion rate and also did not ran into the OOME with
one RS. Right now I am trying to replay the incident.

I know that my scenario would require better machines, but those are what I
have now and am before production running stress tests. In comparison with
0.20.6 the 0.90.2 is less stable regarding the insertion but it scales
sub-linearily (v0.20.6 did not scale on my data) in terms of random access
queries (including multi-versioned data) - have done extensive comparison
regarding this.


stack-3 wrote:
> On Wed, Apr 27, 2011 at 2:30 AM, Stan Barton <bartx007@gmail.com> wrote:
>> Hi,
>> what means increase? I checked on the client machines and the nproc limit
>> is
>> around 26k, that seems to be as sufficient. The same limit applies on the
>> db
>> machines...
> The nproc and ulimits are 26k for the user who is running the
> hadoop/hbase processes?
> You checked the .out files?   You can pastebin your configuration and
> we'll take a look at them.
> Sounds like the hang is in the client if you can still get to the
> cluster from a new shell....
> As Mike says, tell us more about your context.  How many regions on
> each server.  What is your payload like?
> Thanks,
> St.Ack

View this message in context: http://old.nabble.com/HTable.put-hangs-on-bulk-loading-tp31338874p31496726.html
Sent from the HBase User mailing list archive at Nabble.com.

View raw message