hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent Sheffer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5677) Need error checking for HA cluster configuration
Date Thu, 02 Jan 2014 17:31:52 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860369#comment-13860369

Vincent Sheffer commented on HDFS-5677:

I've found a good place and condition to check in *DFSUtil* to add a warning message.  This
seems to occur at startup, which will be good if someone is monitoring the logs at startup.

For concreteness, here is the relevant fragment from my hdfs-site.xml file:

The relevant portion of the DFSUtil code is:
  private static Map<String, InetSocketAddress> getAddressesForNameserviceId(
      Configuration conf, String nsId, String defaultValue,
      String[] keys) {
    Collection<String> nnIds = getNameNodeIds(conf, nsId);
    Map<String, InetSocketAddress> ret = Maps.newHashMap();
    for (String nnId : emptyAsSingletonNull(nnIds)) {
      String suffix = concatSuffixes(nsId, nnId);
      String address = getConfValue(defaultValue, suffix, conf, keys);
      if (address != null) {
        InetSocketAddress isa = NetUtils.createSocketAddr(address);
        ret.put(nnId, isa);
    return ret;

For my node with the missing hyphen (vince2), the resulting entry in the map for the InetSocketAddress
of vince2 will be *myCluster:8020*, which remains unresolved always.  And even though I do
have valid properties for vince-2, those are ignored due to the typo.

My question is why pass in *myCluster:8020* as the default value to *getConfValue* (null might
be better here) when it will never be a valid hostname in this case and will never resolve.
 My hunch is that in the non HA case, this code works fine, which may make changing it a bit
tricky.  If, on the other hand, this code path isn't taken in the non HA case, then it may
be pretty easy to provide better configuration validation.  I'm new to Hadoop development
and, so, don't have a good sense of what sort of hornet's nest I may be kicking in trying
to make the configuration validation absolutely bullet proof.  

The bottom line for now: I do have a simple patch that will, at least, log the problem with
the unresolved entry on startup.  

That message is, at least, a minor improvement in that some where in the logs will be information
useful to someone doing trouble shooting.  If the problem doesn't manifest itself until the
primary NN goes down, however, then this fix won't be as useful since the more informative
message might be buried in the log file.

A slightly better fix may be to tweak the ongoing message (the one I have in the original
description to this Jira is recurring) to better reflect the condition being reported and
to direct the engineer where to look in the configuration for the likely culprit.

> Need error checking for HA cluster configuration
> ------------------------------------------------
>                 Key: HDFS-5677
>                 URL: https://issues.apache.org/jira/browse/HDFS-5677
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, ha
>    Affects Versions: 2.0.6-alpha
>         Environment: centos6.5, oracle jdk6 45, 
>            Reporter: Vincent Sheffer
>            Assignee: Vincent Sheffer
>            Priority: Minor
> If a node is declared in the *dfs.ha.namenodes.myCluster* but is _not_ later defined
in subsequent *dfs.namenode.servicerpc-address.myCluster.nodename* or *dfs.namenode.rpc-address.myCluster.XXX*
properties no error or warning message is provided to indicate that.
> The only indication of a problem is a log message like the following:
> {code}
> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: myCluster:8020
> {code}
> Another way to look at this is that no error or warning is provided when a servicerpc-address/rpc-address
property is defined for a node without a corresponding node declared in *dfs.ha.namenodes.myCluster*.
> This arose when I had a typo in the *dfs.ha.namenodes.myCluster* property for one of
my node names.  It would be very helpful to have at least a warning message on startup if
there is a configuration problem like this.

This message was sent by Atlassian JIRA

View raw message