hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-1226) Inconsistent hostname leads to low data locality on IPv6 hosts
Date Tue, 24 Feb 2015 21:08:05 GMT

     [ https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Allen Wittenauer updated YARN-1226:
    Labels: ipv6  (was: )

> Inconsistent hostname leads to low data locality on IPv6 hosts
> --------------------------------------------------------------
>                 Key: YARN-1226
>                 URL: https://issues.apache.org/jira/browse/YARN-1226
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>    Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
>         Environment: Linux, IPv6
>            Reporter: Kaibo Zhou
>              Labels: ipv6
> When I run a mapreduce job which use TableInputFormat to scan a hbase table on yarn cluser
with 140+ nodes, I consistently get very low data locality around 0~10%. 
> The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the cluster with
NodeManager, DataNode and HRegionServer run on the same node.
> The reason of low data locality is: most machines in the cluster uses IPV6, few machines
use IPV4. NodeManager use "InetAddress.getLocalHost().getHostName()" to get the host name,
but the return result of this function depends on IPV4 or IPV6, see ["InetAddress.getLocalHost().getHostName()
returns FQDN"|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 
> On machines with ipv4, NodeManager get hostName as: search042097.sqa.cm4.site.net
> But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
> if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns search042097.sqa.cm4.site.net.
> ----
> For the mapred job which scan hbase table, the InputSplit contains node locations of
[FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. search042097.sqa.cm4.site.net. Because in hbase,
the RegionServers' hostnames are allocated by HMaster. HMaster communicate with RegionServers
and get the region server's host name use java NIO: clientChannel.socket().getInetAddress().getHostName().
> Also see the startup log of region server:
> 13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us
hostname to use. Was=search042024.sqa.cm4, Now=search042024.sqa.cm4.site.net
> ----
> As you can see, most machines in the Yarn cluster with IPV6 get the short hostname, but
hbase always get the full hostname, so the Host cannot matched (see RMContainerAllocator::assignToMap).This
can lead to poor locality.
> After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data locality
in the cluster.
> Thanks,
> Kaibo

This message was sent by Atlassian JIRA

View raw message