hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen Chu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11328) ZKFailoverController.java does not log Exception and causes latent problems during failover
Date Mon, 24 Nov 2014 19:37:12 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223378#comment-14223378
] 

Stephen Chu commented on HADOOP-11328:
--------------------------------------

Thanks, Tianyin. I agree the log will be helpful. +1 (non-binding)

> ZKFailoverController.java does not log Exception and causes latent problems during failover
> -------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11328
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11328
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.5.1
>            Reporter: Tianyin Xu
>         Attachments: ZKFailoverController.log.exception.1.patch
>
>
> In _ZKFailoverController.java_, the _Exception_ caught by the _run()_ method does not
have a single error log. This causes latent problems that are only manifested during failover.
> h5. The problem we encountered
> An _Exception_ is thrown from the _doRun()_ method during _initHM()_ (caused by a configuration
error). If you want to repeat, you can set 
> "_ha.health-monitor.connect-retry-interval.ms_" to be any nonsensical value.
> {code:title=ZKFailoverController.java|borderStyle=solid}
>   private int doRun(String[] args)
>     ...
>     initRPC();
>     initHM();
>     startRPC();
>     ....
>   }
> {code}
> The Exception is caught in the _run()_ method, as follows,
> {code:title=ZKFailoverController.java|borderStyle=solid}
>   public int run(final String[] args) throws Exception {
>     ...
>     try {
>       ...
>         @Override
>         public Integer run() {
>           try {
>             return doRun(args);
>           } catch (Exception t) {
>             throw new RuntimeException(t);
>           } finally {
>             if (elector != null) {
>               elector.terminateConnection();
>             }
>           }
>         }
>       });
>     } catch (RuntimeException rte) {
>       throw (Exception)rte.getCause();
>     }
>   }
> {code}
> Unfortunately, the Exception (causing the shutdown of the process) is *not logged at
all*. This causes latent errors which is only manifested during failover (because ZKFC is
dead). The tricky thing here is that everything looks perfectly fine: the _jps_ command shows
a running DFSZKFailoverController process and the two NameNode (active and standby) work fine.

> h5. Patch
> We strongly suggest to add a error log to notify the error caught, such as,
> --- hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
(revision 1641307)
> +++ hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ha/ZKFailoverController.java
(working copy)
> {code:title=@@ -178,6 +178,7 @@|borderStyle=solid}
>          }
>        });
>      } catch (RuntimeException rte) {
> +      LOG.fatal("The failover controller encounters runtime error: " + rte);
>        throw (Exception)rte.getCause();
>      }
>    }
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message