hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinsong Hu" <jinsong...@hotmail.com>
Subject Re: dilemma of memory and CPU for hbase.
Date Fri, 02 Jul 2010 00:49:58 GMT

I do have some errors , such as

2010-07-01 22:53:30,187 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
crea
teBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 
10.11
0.8.85:50010
java.io.EOFException

2010-07-01 23:00:49,976 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
crea
teBlockOutputStream java.net.ConnectException: Connection timed out
2010-07-01 23:04:13,356 INFO org.apache.hadoop.hdfs.DFSClient: Exception in 
crea
teBlockOutputStream java.net.ConnectException: Connection timed out


seems they are all hadoop data node errors.

I searched and people say I need to increase dfs.datanode.max.xcievers to 
2K, and increase
ulimit to 32K ( currently it is set at 16K).

I will get that done and do more testing.

Jimmy.

--------------------------------------------------
From: "Jean-Daniel Cryans" <jdcryans@apache.org>
Sent: Thursday, July 01, 2010 5:41 PM
To: <user@hbase.apache.org>
Subject: Re: dilemma of memory and CPU for hbase.

> When I start HBase I usually just tail the master log, but it's
> actually just a few seconds then another few seconds for .META. then
> it starts assigning all other regions.
>
> Did you make sure your master log was clean of errors?
>
> J-D
>
> On Thu, Jul 1, 2010 at 5:40 PM, Jinsong Hu <jinsong_hu@hotmail.com> wrote:
>> yes, it terminated correctely. there is no exception while running the
>> add_table.
>>
>> are you saying that after restart, I need to wait for some time for the
>> -ROOT- to
>> be assigned ? usually how long I need to wait ?
>>
>> Jimmy
>>
>> --------------------------------------------------
>> From: "Jean-Daniel Cryans" <jdcryans@apache.org>
>> Sent: Thursday, July 01, 2010 5:27 PM
>> To: <user@hbase.apache.org>
>> Subject: Re: dilemma of memory and CPU for hbase.
>>
>>> Did you see any exception when you ran add_table? Did it even
>>> terminated correctly?
>>>
>>> After a restart, the regions aren't readily available. If something
>>> blocked the master from assigning -ROOT-, it should be pretty evident
>>> by looking at the master log.
>>>
>>> J-D
>>>
>>> On Thu, Jul 1, 2010 at 5:23 PM, Jinsong Hu <jinsong_hu@hotmail.com> 
>>> wrote:
>>>>
>>>> After I run the add_table.rb, I  refreshed the master's UI page, and 
>>>> then
>>>> clicked on the table to show the regions. I expect that all regions 
>>>> will
>>>> be
>>>> there.
>>>> But , I found that there are significantly fewer regions. Lots of 
>>>> regions
>>>> that was there before were gone.
>>>>
>>>> I then restarted the whole hbase master and region server. And now it 
>>>> is
>>>> even worse. the master UI page doesn't even load. saying the _ROOT 
>>>> region
>>>> is and .META is not served by any regionserver.  The whole cluster is 
>>>> not
>>>> in
>>>> a usable state.
>>>>
>>>> That forced me to rename the /hbase to /hbase-0.20.4, and restart all
>>>> hbase
>>>> master and regionservers. recreate all tables, etc.essentially starting
>>>> from scratch.
>>>>
>>>> Jimmy
>>>>
>>>> --------------------------------------------------
>>>> From: "Jean-Daniel Cryans" <jdcryans@apache.org>
>>>> Sent: Thursday, July 01, 2010 5:10 PM
>>>> To: <user@hbase.apache.org>
>>>> Subject: Re: dilemma of memory and CPU for hbase.
>>>>
>>>>> add_table.rb doesn't actually write much in the file system, all your
>>>>> data is still there. It just wipes all the .META. entries and replaces
>>>>> them with the .regioninfo files found in every region directory.
>>>>>
>>>>> Can you define what you mean by "corrupted". It's really an
>>>>> overloaded-term.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Thu, Jul 1, 2010 at 5:01 PM, Jinsong Hu <jinsong_hu@hotmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi, Jean:
>>>>>>  Thanks! I will run the add_table.rb and see if it fixes the problem.
>>>>>>  Our namenode is backed up with  HA and DRBD, and the hbase master
>>>>>> machine
>>>>>> colocates with name node , job tracker so we are not wasting 
>>>>>> resources.
>>>>>>
>>>>>>  The region hole probably comes from previous 0.20.4 hbase operation.
>>>>>> the
>>>>>> 0.20.4 hbase was
>>>>>> very unstable during its operation. lots of times the master says
the
>>>>>> region
>>>>>> is not there but actually
>>>>>> the region server says it was serving the region.
>>>>>>
>>>>>>
>>>>>> I followed the instruction and run commands like
>>>>>>
>>>>>> bin/hbase org.jruby.Main bin/add_table.rb /hbase/table_name
>>>>>>
>>>>>> After the execution, I found all my tables are corrupted and I can't
>>>>>> use
>>>>>> it
>>>>>> any more. restarting hbase
>>>>>> doesn't help either. I have to wipe out all the /hbase directory
and
>>>>>> start
>>>>>> from scratch.
>>>>>>
>>>>>>
>>>>>> it looks that the add_table.rb can corrupt the whole hbase.  Anyway,

>>>>>> I
>>>>>> am
>>>>>> regenerating the data from
>>>>>> scratch and let's see if it will work out.
>>>>>>
>>>>>> Jimmy.
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------
>>>>>> From: "Jean-Daniel Cryans" <jdcryans@apache.org>
>>>>>> Sent: Thursday, July 01, 2010 2:17 PM
>>>>>> To: <user@hbase.apache.org>
>>>>>> Subject: Re: dilemma of memory and CPU for hbase.
>>>>>>
>>>>>>> (taking the conversation back to the list after receiving logs
and
>>>>>>> heap
>>>>>>> dump)
>>>>>>>
>>>>>>> The issue here is actually much more nasty than it seems. But
before 
>>>>>>> I
>>>>>>> describe the problem, you said:
>>>>>>>
>>>>>>>>  I have 3 machines as hbase master (only 1 is active), 3

>>>>>>>> zookeepers.
>>>>>>>> 8
>>>>>>>> regionservers.
>>>>>>>
>>>>>>> If those are all distinct machines, you are wasting a lot of

>>>>>>> hardware.
>>>>>>> Unless you have a HA Namenode (I highly doubt), then you already

>>>>>>> have
>>>>>>> a SPOF there so you might as well put every service on that single
>>>>>>> node (1 master, 1 zookeeper). You might be afraid of using only
1 ZK
>>>>>>> node, but unless you share the zookeeper ensemble between clusters
>>>>>>> then losing the Namenode is as bad as losing ZK so might as well
put
>>>>>>> them together. At StumbleUpon we have 2-3 clusters using the
same
>>>>>>> ensembles, so it makes more sense to put them in a HA setup.
>>>>>>>
>>>>>>> That said, in your log I see:
>>>>>>>
>>>>>>> 2010-06-29 00:00:00,064 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>>>>> interrupted at index=0 because:Requested row out of range for

>>>>>>> HRegion
>>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>>>>> ...
>>>>>>> 2010-06-29 12:26:13,352 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Batch puts
>>>>>>> interrupted at index=0 because:Requested row out of range for

>>>>>>> HRegion
>>>>>>> Spam_MsgEventTable,2010-06-28 11:34:02blah
>>>>>>>
>>>>>>> So for 12 hours (and probably more), the same row was requested

>>>>>>> almost
>>>>>>> every 100ms but it was always failing on a WrongRegionException
>>>>>>> (that's the name of what we see here). You probably use the write
>>>>>>> buffer since you want to import as fast as possible, so all these
>>>>>>> buffers are left unused after the clients terminate their RPC.
That
>>>>>>> rate of failed insertion must have kept your garbage collector

>>>>>>> _very_
>>>>>>> busy, and at some point the JVM OOMEd. This is the stack from
your
>>>>>>> OOME:
>>>>>>>
>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:175)
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:835)
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:419)
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:318)
>>>>>>>
>>>>>>> This is where we deserialize client data, so it correlates with
what 
>>>>>>> I
>>>>>>> just described.
>>>>>>>
>>>>>>> Now, this means that you probably have a hole (or more) in your

>>>>>>> .META.
>>>>>>> table. It usually happens after a region server fails if it was
>>>>>>> carrying it (since data loss is possible with that version of
HDFS) 
>>>>>>> or
>>>>>>> if a bug in the master messes up the .META. region. Now 2 things:
>>>>>>>
>>>>>>> - It would be nice to know why you have a hole. Look at your
.META.
>>>>>>> table around the row in your region server log, you should see
that
>>>>>>> the start/end keys don't match. Then you can look in the master
log
>>>>>>> from yesterday to search for what went wrong, maybe see some
>>>>>>> exceptions, or maybe a region server failed for any reason and
it 
>>>>>>> was
>>>>>>> hosting .META.
>>>>>>>
>>>>>>> - You probably want to fix your table. Use the bin/add_table.rb
>>>>>>> script (other people on this list used it in the past, search
the
>>>>>>> archive for more info).
>>>>>>>
>>>>>>> Finally (whew!), if you are still developing your solution around
>>>>>>> HBase, you might want to try out one of our dev release that
does 
>>>>>>> work
>>>>>>> with a durable Hadoop release. See
>>>>>>> http://hbase.apache.org/docs/r0.89.20100621/ for more info. 
>>>>>>> Cloudera's
>>>>>>> CDH3b2 also has everything you need.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Thu, Jul 1, 2010 at 12:03 PM, Jean-Daniel Cryans
>>>>>>> <jdcryans@apache.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> 653 regions is very low, even if you had a total of 3 region

>>>>>>>> servers
>>>>>>>> I
>>>>>>>> wouldn't expect any problem.
>>>>>>>>
>>>>>>>> So to me it seems to point towards either a configuration
issue or 
>>>>>>>> a
>>>>>>>> usage issue. Can you:
>>>>>>>>
>>>>>>>>  - Put the log of one region server that OOMEd on a public
server.
>>>>>>>>  - Tell us more about your setup: # of nodes, hardware, 
>>>>>>>> configuration
>>>>>>>> file
>>>>>>>>  - Tell us more about how you insert data into HBase
>>>>>>>>
>>>>>>>> And BTW are you trying to do an initial import of your data
set? If
>>>>>>>> so, have you considered using HFileOutputFormat?
>>>>>>>>
>>>>>>>> Thx,
>>>>>>>>
>>>>>>>> J-D
>>>>>>>>
>>>>>>>> On Thu, Jul 1, 2010 at 11:52 AM, Jinsong Hu 
>>>>>>>> <jinsong_hu@hotmail.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi, Sir:
>>>>>>>>>  I am using hbase 0.20.5 and this morning I found that
3 of  my
>>>>>>>>> region
>>>>>>>>> server running out of memory.
>>>>>>>>> the regionserver is given 6G memory each, and on average,
I have 
>>>>>>>>> 653
>>>>>>>>> regions
>>>>>>>>> in total. max store size
>>>>>>>>> is 256M. I analyzed the dump and it shows that there
are too many
>>>>>>>>> HRegion in
>>>>>>>>> memory.
>>>>>>>>>
>>>>>>>>>  Previously set max store size to 2G, but then I found
the region
>>>>>>>>> server
>>>>>>>>> constantly does minor compaction and the CPU usage is
very high, 
>>>>>>>>> It
>>>>>>>>> also
>>>>>>>>> blocks the heavy client record insertion.
>>>>>>>>>
>>>>>>>>>  So now I am limited on one side by memory,  limited
on another 
>>>>>>>>> size
>>>>>>>>> by
>>>>>>>>> CPU.
>>>>>>>>> Is there anyway to get out of this dilemma ?
>>>>>>>>>
>>>>>>>>> Jimmy.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 

Mime
View raw message