hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: dilemma of memory and CPU for hbase.
Date Fri, 02 Jul 2010 00:23:00 GMT
After I run the add_table.rb, I  refreshed the master's UI page, and then 
clicked on the table to show the regions. I expect that all regions will be 
there.
But , I found that there are significantly fewer regions. Lots of regions 
that was there before were gone.

I then restarted the whole hbase master and region server. And now it is 
even worse. the master UI page doesn't even load. saying the _ROOT region
is and .META is not served by any regionserver.  The whole cluster is not in 
a usable state.

That forced me to rename the /hbase to /hbase-0.20.4, and restart all hbase 
master and regionservers. recreate all tables, etc.essentially starting
from scratch.

Jimmy

--------------------------------------------------
From: "Jean-Daniel Cryans" <jdcryans@apache.org>
Sent: Thursday, July 01, 2010 5:10 PM
To: <user@hbase.apache.org>
Subject: Re: dilemma of memory and CPU for hbase.

> add_table.rb doesn't actually write much in the file system, all your
> data is still there. It just wipes all the .META. entries and replaces
> them with the .regioninfo files found in every region directory.
>
> Can you define what you mean by "corrupted". It's really an 
> overloaded-term.
>
> J-D
>
> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <jinsong_hu@hotmail.com> wrote:
>> Hi, Jean:
>>  Thanks! I will run the add_table.rb and see if it fixes the problem.
>>  Our namenode is backed up with  HA and DRBD, and the hbase master 
>> machine
>> colocates with name node , job tracker so we are not wasting resources.
>>
>>  The region hole probably comes from previous 0.20.4 hbase operation. the
>> 0.20.4 hbase was
>> very unstable during its operation. lots of times the master says the 
>> region
>> is not there but actually
>> the region server says it was serving the region.
>>
>>
>> I followed the instruction and run commands like
>>
>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name
>>
>> After the execution, I found all my tables are corrupted and I can't use 
>> it
>> any more. restarting hbase
>> doesn't help either. I have to wipe out all the /hbase directory and 
>> start
>> from scratch.
>>
>>
>> it looks that the add_table.rb can corrupt the whole hbase.  Anyway, I am
>> regenerating the data from
>> scratch and let's see if it will work out.
>>
>> Jimmy.
>>
>>
>> --------------------------------------------------
>> From: "Jean-Daniel Cryans" <jdcryans@apache.org>
>> Sent: Thursday, July 01, 2010 2:17 PM
>> To: <user@hbase.apache.org>
>> Subject: Re: dilemma of memory and CPU for hbase.
>>
>>> (taking the conversation back to the list after receiving logs and heap
>>> dump)
>>>
>>> The issue here is actually much more nasty than it seems. But before I
>>> describe the problem, you said:
>>>
>>>>  I have 3 machines as hbase master (only 1 is active), 3 zookeepers. 8
>>>> regionservers.
>>>
>>> If those are all distinct machines, you are wasting a lot of hardware.
>>> Unless you have a HA Namenode (I highly doubt), then you already have
>>> a SPOF there so you might as well put every service on that single
>>> node (1 master, 1 zookeeper). You might be afraid of using only 1 ZK
>>> node, but unless you share the zookeeper ensemble between clusters
>>> then losing the Namenode is as bad as losing ZK so might as well put
>>> them together. At StumbleUpon we have 2-3 clusters using the same
>>> ensembles, so it makes more sense to put them in a HA setup.
>>>
>>> That said, in your log I see:
>>>
>>> 2010-06-29 00:00:00,064 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>> interrupted at index=0 because:Requested row out of range for HRegion
>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>> ...
>>> 2010-06-29 12:26:13,352 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>> interrupted at index=0 because:Requested row out of range for HRegion
>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>
>>> So for 12 hours (and probably more), the same row was requested almost
>>> every 100ms but it was always failing on a WrongRegionException
>>> (that's the name of what we see here). You probably use the write
>>> buffer since you want to import as fast as possible, so all these
>>> buffers are left unused after the clients terminate their RPC. That
>>> rate of failed insertion must have kept your garbage collector _very_
>>> busy, and at some point the JVM OOMEd. This is the stack from your
>>> OOME:
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
>>> at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
>>>
>>> This is where we deserialize client data, so it correlates with what I
>>> just described.
>>>
>>> Now, this means that you probably have a hole (or more) in your .META.
>>> table. It usually happens after a region server fails if it was
>>> carrying it (since data loss is possible with that version of HDFS) or
>>> if a bug in the master messes up the .META. region. Now 2 things:
>>>
>>> - It would be nice to know why you have a hole. Look at your .META.
>>> table around the row in your region server log, you should see that
>>> the start/end keys don't match. Then you can look in the master log
>>> from yesterday to search for what went wrong, maybe see some
>>> exceptions, or maybe a region server failed for any reason and it was
>>> hosting .META.
>>>
>>> - You probably want to fix your table. Use the bin/add_table.rb
>>> script (other people on this list used it in the past, search the
>>> archive for more info).
>>>
>>> Finally (whew!), if you are still developing your solution around
>>> HBase, you might want to try out one of our dev release that does work
>>> with a durable Hadoop release. See
>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. Cloudera's
>>> CDH3b2 also has everything you need.
>>>
>>> J-D
>>>
>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans 
>>> <jdcryans@apache.org>
>>> wrote:
>>>>
>>>> 653 regions is very low, even if you had a total of 3 region servers I
>>>> wouldn't expect any problem.
>>>>
>>>> So to me it seems to point towards either a configuration issue or a
>>>> usage issue. Can you:
>>>>
>>>>  - Put the log of one region server that OOMEd on a public server.
>>>>  - Tell us more about your setup: # of nodes, hardware, configuration
>>>> file
>>>>  - Tell us more about how you insert data into HBase
>>>>
>>>> And BTW are you trying to do an initial import of your data set? If
>>>> so, have you considered using HFileOutputFormat?
>>>>
>>>> Thx,
>>>>
>>>> J-D
>>>>
>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu <jinsong_hu@hotmail.com>
>>>> wrote:
>>>>>
>>>>> Hi, Sir:
>>>>>  I am using hbase 0.20.5 and this morning I found that 3 of  my region
>>>>> server running out of memory.
>>>>> the regionserver is given 6G memory each, and on average, I have 653
>>>>> regions
>>>>> in total. max store size
>>>>> is 256M. I analyzed the dump and it shows that there are too many
>>>>> HRegion in
>>>>> memory.
>>>>>
>>>>>  Previously set max store size to 2G, but then I found the region 
>>>>> server
>>>>> constantly does minor compaction and the CPU usage is very high, It 
>>>>> also
>>>>> blocks the heavy client record insertion.
>>>>>
>>>>>  So now I am limited on one side by memory,  limited on another size

>>>>> by
>>>>> CPU.
>>>>> Is there anyway to get out of this dilemma ?
>>>>>
>>>>> Jimmy.
>>>>>
>>>>
>>>
>>
> 

Mime
View raw message