hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Sautins <andy.saut...@returnpath.net>
Subject RE: DFS stability running HBase and dfs.datanode.handler.count...
Date Mon, 18 Apr 2011 16:33:15 GMT

   J-D, thanks for the response.  I suspect you are right that the error message was a red
herring.  However, I don't see "Client session timed out" in any of our logs when the region
server dies so I'm not sure if it is the situation you are thinking of.  Note that we are
going to enable GC logging to get an idea of how long we pause in GC in our regionservers.
 We also set hbase.hregion.memstore.mslab.enabled to true in hbase-site.xml which seemed to
have helped, although I don't have too many datapoints at this point to confirm.  Before we
enabled mslab we had region servers die every day or two.  After enabling mslab we just went
4 days without a regionserver dying which was better.

   Following down our latest Region server that died, it seems to die because it can't close
the hlog which sounds like the situation you mentioned in your response, although I'm not
sure I fully understand IO Fencing.  The stack traces are below.

   Are there any configuration parameters that could help keep the regionserver from dying
when it encounters this situation?  We generally just restart the regionserver and it recovers
nicely, but it would be ideal to not need to login and re-start regionservers due to this
situation.

   At this point it sounds like my best bet is to try to understand if we are still seeing
big GC pauses even with mslab enabled to identify if that is our problem.

   Thanks for your help

   Andy



2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
Requesting close of hlog
java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because
recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010.
Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2477)
2011-04-18 02:04:12,862 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append.
Requesting close of hlog
java.io.IOException: Error Recovery for block blk_-4878958038737199123_30707596 failed  because
recovery from primary datanode 10.18.0.28:50010 failed 6 times.  Pipeline was 10.18.0.28:50010.
Aborting...

  The above exception seems to correlate to the following error on the datanode.

2011-04-18 02:04:08,773 WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol:
Failed to getBlockMetaDataInfo for block (=blk_-4878958038737199123_30707596) from datanode
(=10.18.0.28:50010)
java.io.IOException: Block blk_-4878958038737199123_30711904 length is 45855569 does not match
block file length 45853290
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.validateBlockMetadata(FSDataset.java:1661)
        at org.apache.hadoop.hdfs.server.datanode.FSDataset.startBlockRecovery(FSDataset.java:1921)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startBlockRecovery(DataNode.java:1658)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1772)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1930)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1415)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1411)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Unknown Source)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1409)
2011-04-18 02:04:08,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020,
call recoverBlock(blk_-4878958038737199123_30707596, false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@2cda6e31)
from 10.18.0.28:52651: error: java.io.IOException: All datanodes failed: block=blk_-4878958038737199123_30707596,
datanodeids=[10.18.0.28:50010]

  And the block does look like it is associated with a log file:

2011-04-18 01:37:17,661 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock:
/user/hbase/.logs/hd28.dfs.returnpath.net,60020,1302812828956/hd28.dfs.returnpath.net%3A60020.1303112237535.
blk_-4878958038737199123_30707596




-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Thursday, April 14, 2011 11:20 AM
To: user@hbase.apache.org
Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...

This is probably a red herring, for example if the region server had a
big GC pause then the master could have already split the log and the
region server wouldn't be able to close it (that's our version of IO
fencing). So from that exception look back in the log and see if
there's anything like :

INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have
not heard from server in some_big_number ms

J-D

On Thu, Apr 14, 2011 at 7:24 AM, Andy Sautins
<andy.sautins@returnpath.net> wrote:
>
>  Thanks for the response stack.  Yes we tried increasing dfs.datanode.handler.count
to 8.   At this point I would say it didn't seem to resolve the issue we are seeing, but
we it also doesn't seem to be hurting anything so for right now we're going to leave it in
at 8 while we continue to debug.
>
>  In regard to the original error I posted ( Block 'x' is not valid ) we have chased
that down thanks to your suggestion of looking at the logs for the history of the block.  It
_looks_ like our 'is not valid' block errors are unrelated and due to chmod or deleting mapreduce
output directories directly after a run.  We are still isolating that but it looks like it's
not HBase releated so I'll move that to another list.  Thank you very much for your debugging
suggestions.
>
>   The one issue we are still seeing is that we will occasionally have a regionserver
die with the following exception.  I need to chase that down a little more but it seems similar
to a post from 2/13/2011 (http://www.mail-archive.com/user@hbase.apache.org/msg05550.html
) that I'm not sure was ever resolved or not.  If anyone has any insight on how to debug
the following error a little more I would appreciate any thoughts you might have.
>
> 2011-04-14 06:05:13,001 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file
/user/hbase/.logs/hd10.dfs.returnpath.net,60020,1302555127291/hd10.dfs.returnpath.net%3A60020.1302781635921
: java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed  because
recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
Aborting...
> java.io.IOException: Error Recovery for block blk_1315316969665710488_29842654 failed
 because recovery from primary datanode 10.18.0.16:50010 failed 6 times.  Pipeline was 10.18.0.16:50010.
Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2841)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2305)
>
> Other than the above exception causing a region server to die occasionally everything
seems to be working well.
>
> Note we have now upgraded to Cloudera CDH Version 3 Update 0 ( hadoop 0.20.2+923.21 and
hbase 0.90.1+15.18 ) and still see the above exception.  We do have ulimit set ( memory unlimited
and files 32k ) for the user running hbase.
>
> Thanks again for your help
>
>  Andy
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Sunday, April 10, 2011 1:16 PM
> To: user@hbase.apache.org
> Cc: Andy Sautins
> Subject: Re: DFS stability running HBase and dfs.datanode.handler.count...
>
> Did you try upping it Andy?  Andrew Purtell's recommendation though old would have come
of experience.  The Intel article reads like sales but there is probably merit to its suggestion.
 The Cloudera article is more unsure about the effect of upping handlers though it allows
"...could be set a bit higher."
>
> I just looked at our prod frontend and its set to 3 still.  I don't see your exceptions
in our DN log.
>
> What version of hadoop?  You say hbase 0.91.  You mean 0.90.1?
>
> ulimit and nproc are set sufficiently high for hadoop/hbase user?
>
> If you grep 163126943925471435_28809750 in namenode log, do you see a delete occur before
a later open?
>
> St.Ack
>
> On Sat, Apr 9, 2011 at 4:35 PM, Andy Sautins <andy.sautins@returnpath.net> wrote:
>>
>>    I ran across an mailing list posting from 1/4/2009 that seemed to indicate increasing
dfs.datanode.handler.count could help improve DFS stability (http://mail-archives.apache.org/mod_mbox/hbase-user/200901.mbox/%3C49605FE0.9040509@duboce.net%3E
).  The posting seems to indicate the wiki was updated, but I don't seen anything in the
wiki about increasing dfs.datanode.handler.count.   I have seen a few other notes that seem
to show examples that have raised dfs.datanode.handler.count including one from an IBM article
(http://software.intel.com/en-us/articles/hadoop-and-hbase-optimization-for-read-intensive-search-applications/
) and the Pro Hadoop book, but other than that the only other mention I see is from cloudera
seems luke-warm on increasing dfs.datanode.handler.count (http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/
).
>>
>>    Given the post is from 2009 I thought I'd ask if anyone has had any success
improving stability of HBase/DFS when increasing dfs.datanode.handler.count.  The specific
error we are seeing somewhat  frequently ( few hundred times per day ) in the datanode longs
is as follows:
>>
>> 2011-04-09 00:12:48,035 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(10.18.0.33:50010,
>> storageID=DS-1501576934-10.18.0.33-50010-1296248656454,
>> infoPort=50075, ipcPort=50020):DataXceiver
>> java.io.IOException: Block blk_-163126943925471435_28809750 is not valid.
>>
>>   The above seems to correspond to ClosedChannelExceptions in the hbase regionserver
logs as well as some warnings about long write to hlog ( some in the 50+ seconds ).
>>
>>    The biggest end-user facing issue we are seeing is that Task Trackers keep getting
blacklisted.  It's quite possible our problem is unrelated to anything HBase, but I thought
it was worth asking given what we've been seeing.
>>
>>   We are currently running 0.91 on an 18 node cluster with ~3k total regions and
each region server is running with 2G of memory.
>>
>>   Any insight would be appreciated.
>>
>>   Thanks
>>
>>    Andy
>>
>

Mime
View raw message