hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikola Vujic (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-5846) Assigning DEFAULT_RACK in resolveNetworkLocation method can break data resiliency
Date Wed, 29 Jan 2014 18:38:11 GMT
Nikola Vujic created HDFS-5846:
----------------------------------

             Summary: Assigning DEFAULT_RACK in resolveNetworkLocation method can break data
resiliency
                 Key: HDFS-5846
                 URL: https://issues.apache.org/jira/browse/HDFS-5846
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Nikola Vujic
            Assignee: Nikola Vujic


Medhod CachedDNSToSwitchMapping::resolve() can return NULL which requires careful handling.
Null can be returned in two cases:
• An error occurred with topology script execution (script crashes).
• Script returns wrong number of values (other than expected)

Critical handling is in the DN registration code. DN registration code is responsible for
assigning proper topology paths to all registered datanodes. Existing code handles this NULL
pointer on the following way ({{resolveNetworkLocation}} method):
{code}
/ /resolve its network location
    List<String> rName = dnsToSwitchMapping.resolve(names);
    String networkLocation;
    if (rName == null) {
      LOG.error("The resolve call returned null! Using " + 
          NetworkTopology.DEFAULT_RACK + " for host " + names);
      networkLocation = NetworkTopology.DEFAULT_RACK;
    } else {
      networkLocation = rName.get(0);
    }
    return networkLocation;
{code}

The line of code that is assigning default rack:
{code} networkLocation = NetworkTopology.DEFAULT_RACK; {code} 
can cause a serious problem. This means if somehow we got NULL, then the default rack will
be assigned as a DN's network location and DN's registration will finish successfully. Under
this circumstances, we will be able to load data into cluster which is working with a wrong
topology. Wrong  topology means that fault domains are not honored. 

For the end user, it means that two data replicas can end up in the same fault domain and
a single failure can cause loss of two, or more, replicas. Cluster would be in the inconsistent
state but it would not be aware of that and the whole thing would work as if everything was
fine. We can notice that something wrong happened almost only by looking in the log for the
error:
{code}
LOG.error("The resolve call returned null! Using " + 
NetworkTopology.DEFAULT_RACK + " for host " + names);
{code}
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message