hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huangkaixuan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS
Date Fri, 10 Mar 2017 02:41:38 GMT

    [ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904296#comment-15904296
] 

Huangkaixuan commented on YARN-6289:
------------------------------------

Thanks [~leftnoteasy]
1、MR can get the locations of a block through FileSystem.getFileBlockLocations. Usually
MR applications use FileSystem.getFileBlockLocations to compute splits, but I haven't seen
it in the default Yarn scheduling policy (FIFO)
2、All nodes in the experiment are in the same rack, and all tasks are rack-local. RackAwareness
will not affect the experimental results
3、the task failed to achieve data locality, even though there is no other job running on
the cluster at the same time. it seems that Yarn didn’t attempt to allocate containers with
data locality in the default scheduling mode


> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: distributed-scheduling
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2	Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>         Attachments: Hadoop_Spark_Conf.zip, YARN-DataLocality.docx
>
>
> When running a simple wordcount experiment on YARN, I noticed that the task failed to
achieve data locality, even though there is no other job running on the cluster at the same
time. The experiment was done in a 7-node (1 master, 6 data nodes/node managers) cluster and
the input of the wordcount job (both Spark and MapReduce) is a single-block file in HDFS which
is two-way replicated (replication factor = 2). I ran wordcount on YARN for 10 times. The
results show that only 30% of tasks can achieve data locality, which seems like the result
of a random placement of tasks. The experiment details are in the attachment, and feel free
to reproduce the experiments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message