hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nikola Vujic (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-5846) Assigning DEFAULT_RACK in resolveNetworkLocation method can break data resiliency
Date Fri, 31 Jan 2014 11:04:09 GMT

     [ https://issues.apache.org/jira/browse/HDFS-5846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nikola Vujic updated HDFS-5846:
-------------------------------

    Attachment: hdfs-5846.patch

I'm attaching patch. I have changed {{resolveNetworkLocation}} method to throw exception and
added a new method {{resolveNetworkLocationWithFallBackToDefaultLocation} for calls that want
to keep default location in a case of DNS to switch mapping failure.
I'm adding a configuration property named: {{dfs.namenode.reject-unresolved-dn-topology-mapping}}
(default value: {{false}}). {{registerDatanode}} method uses this config property in order
to decide whether to call {{resolveNetworkLocation}} or {{resolveNetworkLocationWithFallBackToDefaultLocation}}
method.

> Assigning DEFAULT_RACK in resolveNetworkLocation method can break data resiliency
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-5846
>                 URL: https://issues.apache.org/jira/browse/HDFS-5846
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Nikola Vujic
>            Assignee: Nikola Vujic
>         Attachments: hdfs-5846.patch
>
>
> Medhod CachedDNSToSwitchMapping::resolve() can return NULL which requires careful handling.
Null can be returned in two cases:
> • An error occurred with topology script execution (script crashes).
> • Script returns wrong number of values (other than expected)
> Critical handling is in the DN registration code. DN registration code is responsible
for assigning proper topology paths to all registered datanodes. Existing code handles this
NULL pointer on the following way ({{resolveNetworkLocation}} method):
> {code}
> / /resolve its network location
>     List<String> rName = dnsToSwitchMapping.resolve(names);
>     String networkLocation;
>     if (rName == null) {
>       LOG.error("The resolve call returned null! Using " + 
>           NetworkTopology.DEFAULT_RACK + " for host " + names);
>       networkLocation = NetworkTopology.DEFAULT_RACK;
>     } else {
>       networkLocation = rName.get(0);
>     }
>     return networkLocation;
> {code}
> The line of code that is assigning default rack:
> {code} networkLocation = NetworkTopology.DEFAULT_RACK; {code} 
> can cause a serious problem. This means if somehow we got NULL, then the default rack
will be assigned as a DN's network location and DN's registration will finish successfully.
Under this circumstances, we will be able to load data into cluster which is working with
a wrong topology. Wrong  topology means that fault domains are not honored. 
> For the end user, it means that two data replicas can end up in the same fault domain
and a single failure can cause loss of two, or more, replicas. Cluster would be in the inconsistent
state but it would not be aware of that and the whole thing would work as if everything was
fine. We can notice that something wrong happened almost only by looking in the log for the
error:
> {code}
> LOG.error("The resolve call returned null! Using " + 
> NetworkTopology.DEFAULT_RACK + " for host " + names);
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message