hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huangkaixuan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6289) Fail to achieve data locality when runing MapReduce and Spark on HDFS
Date Sat, 11 Mar 2017 05:21:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-6289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906088#comment-15906088
] 

Huangkaixuan commented on YARN-6289:
------------------------------------

Thanks [~leftnoteasy]

For #1 - Can you explain a little more, the answer is not clear, it should say more conclusively
that  - MR is using FileSystem.getFileBlockLocations but Yarn is not honoring locality in
default scheduling mode.

For #2 - Since the data is all rack local, we are not expecting this experiment to help. Is
there a reason you think it might?

For #3 - There were no other jobs running on the cluster at the same time and we thought we
should get 100% locality all the time. Can you please explain how to get the data locality
here?

> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: distributed-scheduling
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Thread 
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2	Hadoop-2.7.1 
>            Reporter: Huangkaixuan
>         Attachments: Hadoop_Spark_Conf.zip, YARN-DataLocality.docx
>
>
> When running a simple wordcount experiment on YARN, I noticed that the task failed to
achieve data locality, even though there is no other job running on the cluster at the same
time. The experiment was done in a 7-node (1 master, 6 data nodes/node managers) cluster and
the input of the wordcount job (both Spark and MapReduce) is a single-block file in HDFS which
is two-way replicated (replication factor = 2). I ran wordcount on YARN for 10 times. The
results show that only 30% of tasks can achieve data locality, which seems like the result
of a random placement of tasks. The experiment details are in the attachment, and feel free
to reproduce the experiments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message