hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7009) Active NN and standby NN have different live nodes
Date Sat, 04 Oct 2014 02:15:34 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158881#comment-14158881
] 

Hadoop QA commented on HDFS-7009:
---------------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12672870/HDFS-7009-2.patch
  against trunk revision 7f6ed7f.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 1 new or modified
test files.

      {color:red}-1 javac{color}.  The applied patch generated 1280 javac compiler warnings
(more than the trunk's current 1266 warnings).

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:red}-1 findbugs{color}.  The patch appears to introduce 2 new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

                  org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
                  org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover
                  org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits

    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
Javac warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/8320//console

This message is automatically generated.

> Active NN and standby NN have different live nodes
> --------------------------------------------------
>
>                 Key: HDFS-7009
>                 URL: https://issues.apache.org/jira/browse/HDFS-7009
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7009-2.patch, HDFS-7009.patch
>
>
> To follow up on https://issues.apache.org/jira/browse/HDFS-6478, in most cases, given
DN sends HB and BR to NN regularly, if a specific RPC call fails, it isn't a big deal.
> However, there are cases where DN fails to register with NN during initial handshake
due to exceptions not covered by RPC client's connection retry. When this happens, the DN
won't talk to that NN until the DN restarts.
> {noformat}
> BPServiceActor
>   public void run() {
>     LOG.info(this + " starting to offer service");
>     try {
>       // init stuff
>       try {
>         // setup storage
>         connectToNNAndHandshake();
>       } catch (IOException ioe) {
>         // Initial handshake, storage recovery or registration failed
>         // End BPOfferService thread
>         LOG.fatal("Initialization failed for block pool " + this, ioe);
>         return;
>       }
>       initialized = true; // bp is initialized;
>       
>       while (shouldRun()) {
>         try {
>           offerService();
>         } catch (Exception ex) {
>           LOG.error("Exception in BPOfferService for " + this, ex);
>           sleepAndLogInterrupts(5000, "offering service");
>         }
>       }
> ...
> {noformat}
> Here is an example of the call stack.
> {noformat}
> java.io.IOException: Failed on local exception: java.io.IOException: Response is null.;
Host Details : local host is: "xxx"; destination host is: "yyy":8030;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:761)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1239)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at com.sun.proxy.$Proxy9.registerDatanode(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:146)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:623)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:225)
>         at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Response is null.
>         at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:949)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:844)
> {noformat}
> This will create discrepancy between active NN and standby NN in terms of live nodes.
>  
> Here is a possible scenario of missing blocks after failover.
> 1. DN A, B set up handshakes with active NN, but not with standby NN.
> 2. A block is replicated to DN A, B and C.
> 3. From standby NN's point of view, given A and B are dead nodes, the block is under
replicated.
> 4. DN C is down.
> 5. Before active NN detects DN C is down, it fails over.
> 6. The new active NN considers the block is missing. Even though there are two replicas
on DN A and B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message