hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: dilemma of memory and CPU for hbase.
Date Fri, 02 Jul 2010 00:01:58 GMT
Hi, Jean:
  Thanks! I will run the add_table.rb and see if it fixes the problem.
  Our namenode is backed up with  HA and DRBD, and the hbase master machine 
colocates with name node , job tracker so we are not wasting resources.

  The region hole probably comes from previous 0.20.4 hbase operation. the 
0.20.4 hbase was
very unstable during its operation. lots of times the master says the region 
is not there but actually
the region server says it was serving the region.

I followed the instruction and run commands like

bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name

After the execution, I found all my tables are corrupted and I can't use it 
any more. restarting hbase
doesn't help either. I have to wipe out all the /hbase directory and start 
from scratch.

it looks that the add_table.rb can corrupt the whole hbase.  Anyway, I am 
regenerating the data from
scratch and let's see if it will work out.


From: "Jean-Daniel Cryans" <jdcryans@apache.org>
Sent: Thursday, July 01, 2010 2:17 PM
To: <user@hbase.apache.org>
Subject: Re: dilemma of memory and CPU for hbase.

> (taking the conversation back to the list after receiving logs and heap 
> dump)
> The issue here is actually much more nasty than it seems. But before I
> describe the problem, you said:
>>  I have 3 machines as hbase master (only 1 is active), 3 zookeepers. 8
>> regionservers.
> If those are all distinct machines, you are wasting a lot of hardware.
> Unless you have a HA Namenode (I highly doubt), then you already have
> a SPOF there so you might as well put every service on that single
> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK
> node, but unless you share the zookeeper ensemble between clusters
> then losing the Namenode is as bad as losing ZK so might as well put
> them together. At StumbleUpon we have 2-3 clusters using the same
> ensembles, so it makes more sense to put them in a HA setup.
> That said, in your log I see:
> 2010-06-29 00:00:00,064 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
> interrupted at index=0 because:Requested row out of range for HRegion
> Spam_MsgEventTable,2010-06-28 11:34:02blah
> ...
> 2010-06-29 12:26:13,352 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
> interrupted at index=0 because:Requested row out of range for HRegion
> Spam_MsgEventTable,2010-06-28 11:34:02blah
> So for 12 hours (and probably more), the same row was requested almost
> every 100ms but it was always failing on a WrongRegionException
> (that's the name of what we see here). You probably use the write
> buffer since you want to import as fast as possible, so all these
> buffers are left unused after the clients terminate their RPC. That
> rate of failed insertion must have kept your garbage collector _very_
> busy, and at some point the JVM OOMEd. This is the stack from your
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
> at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
> This is where we deserialize client data, so it correlates with what I
> just described.
> Now, this means that you probably have a hole (or more) in your .META.
> table. It usually happens after a region server fails if it was
> carrying it (since data loss is possible with that version of HDFS) or
> if a bug in the master messes up the .META. region. Now 2 things:
> - It would be nice to know why you have a hole. Look at your .META.
> table around the row in your region server log, you should see that
> the start/end keys don't match. Then you can look in the master log
> from yesterday to search for what went wrong, maybe see some
> exceptions, or maybe a region server failed for any reason and it was
> hosting .META.
> - You probably want to fix your table. Use the bin/add_table.rb
> script (other people on this list used it in the past, search the
> archive for more info).
> Finally (whew!), if you are still developing your solution around
> HBase, you might want to try out one of our dev release that does work
> with a durable Hadoop release. See
> http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's
> CDH3b2 also has everything you need.
> J-D
> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans <jdcryans@apache.org> 
> wrote:
>> 653 regions is very low, even if you had a total of 3 region servers I
>> wouldn't expect any problem.
>> So to me it seems to point towards either a configuration issue or a
>> usage issue. Can you:
>>  - Put the log of one region server that OOMEd on a public server.
>>  - Tell us more about your setup: # of nodes, hardware, configuration 
>> file
>>  - Tell us more about how you insert data into HBase
>> And BTW are you trying to do an initial import of your data set? If
>> so, have you considered using HFileOutputFormat?
>> Thx,
>> J-D
>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <jinsong_hu@hotmail.com> 
>> wrote:
>>> Hi, Sir:
>>>  I am using hbase 0.20.5 and this morning I found that 3 of  my region
>>> server running out of memory.
>>> the regionserver is given 6G memory each, and on average, I have 653 
>>> regions
>>> in total. max store size
>>> is 256M. I analyzed the dump and it shows that there are too many 
>>> HRegion in
>>> memory.
>>>  Previously set max store size to 2G, but then I found the region server
>>> constantly does minor compaction and the CPU usage is very high, It also
>>> blocks the heavy client record insertion.
>>>  So now I am limited on one side by memory,  limited on another size by 
>>> CPU.
>>> Is there anyway to get out of this dilemma ?
>>> Jimmy.

View raw message