hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Created: (HBASE-3379) Log splitting slowed by repeated attempts at connecting to downed datanode
Date Tue, 21 Dec 2010 17:50:01 GMT
Log splitting slowed by repeated attempts at connecting to downed datanode
--------------------------------------------------------------------------

                 Key: HBASE-3379
                 URL: https://issues.apache.org/jira/browse/HBASE-3379
             Project: HBase
          Issue Type: Bug
          Components: wal
            Reporter: stack
            Priority: Critical


Testing if I kill RS and DN on a node, log splitting takes longer as we doggedly try connecting
to the downed DN to get WAL blocks.  Here's the cycle I see:

{code}
2010-12-21 17:34:48,239 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_900551257176291912_1203821
failed  because recovery from primary datanode 10.20.20.182:10010 failed 5 times.    Pipeline
was 10.20.20.184:10010, 10.20.20.186:10010, 10.20.20.182:10010. Will retry...
2010-12-21 17:34:50,240 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 0 time(s).
2010-12-21 17:34:51,241 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 1 time(s).
2010-12-21 17:34:52,241 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 2 time(s).
2010-12-21 17:34:53,242 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 3 time(s).
2010-12-21 17:34:54,243 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 4 time(s).
2010-12-21 17:34:55,243 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 5 time(s).
2010-12-21 17:34:56,244 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 6 time(s).
2010-12-21 17:34:57,245 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 7 time(s).
2010-12-21 17:34:58,245 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 8 time(s).
2010-12-21 17:34:59,246 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.20.20.182:10020.
Already tried 9 time(s).
2010-12-21 17:34:59,246 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #5
from primary datanode 10.20.20.182:10010
java.net.ConnectException: Call to /10.20.20.182:10020 failed on connection exception: java.net.ConnectException:
Connection refused
    at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)
    at org.apache.hadoop.ipc.Client.call(Client.java:743)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
    at $Proxy8.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
...
{code}

"because recovery from primary datanode" is done 5 times (hardcoded).  Within these retries
we'll do
{code}
this.maxRetries = conf.getInt("ipc.client.connect.max.retries", 10);
{code}

The hardcoding of 5 attempts we should get fixed and we should doc the ipc.client.connect.max.retries
as important config.  We should recommend bringing it down from default.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message