Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 10 Mar 2017 02:41:38 +0000 (UTC)
From: "Huangkaixuan (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.13048455.1488786629000.29275.1489113698013@Atlassian.JIRA>
In-Reply-To: <JIRA.13048455.1488786629000@Atlassian.JIRA>
References: <JIRA.13048455.1488786629000@Atlassian.JIRA> <JIRA.13048455.1488786629759@jira-lw-us.apache.org>
Subject: [jira] [Commented] (YARN-6289) Fail to achieve data locality when
 runing MapReduce and Spark on HDFS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 10 Mar 2017 02:41:42 -0000


    [ https://issues.apache.org/jira/browse/YARN-6289?page=3Dcom.atlassian.=
jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15904=
296#comment-15904296 ]=20

Huangkaixuan commented on YARN-6289:
------------------------------------

Thanks [~leftnoteasy]
1=E3=80=81MR can get the locations of a block through FileSystem.getFileBlo=
ckLocations. Usually MR applications use FileSystem.getFileBlockLocations t=
o compute splits, but I haven't seen it in the default Yarn scheduling poli=
cy (FIFO)
2=E3=80=81All nodes in the experiment are in the same rack, and all tasks a=
re rack-local. RackAwareness will not affect the experimental results
3=E3=80=81the task failed to achieve data locality, even though there is no=
 other job running on the cluster at the same time. it seems that Yarn didn=
=E2=80=99t attempt to allocate containers with data locality in the default=
 scheduling mode


> Fail to achieve data locality when runing MapReduce and Spark on HDFS
> ---------------------------------------------------------------------
>
>                 Key: YARN-6289
>                 URL: https://issues.apache.org/jira/browse/YARN-6289
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: distributed-scheduling
>         Environment: Hardware configuration
> CPU: 2 x Intel(R) Xeon(R) E5-2620 v2 @ 2.10GHz /15M Cache 6-Core 12-Threa=
d=20
> Memory: 128GB Memory (16x8GB) 1600MHz
> Disk: 600GBx2 3.5-inch with RAID-1
> Network bandwidth: 968Mb/s
> Software configuration
> Spark-1.6.2=09Hadoop-2.7.1=20
>            Reporter: Huangkaixuan
>         Attachments: Hadoop_Spark_Conf.zip, YARN-DataLocality.docx
>
>
> When running a simple wordcount experiment on YARN, I noticed that the ta=
sk failed to achieve data locality, even though there is no other job runni=
ng on the cluster at the same time. The experiment was done in a 7-node (1 =
master, 6 data nodes/node managers) cluster and the input of the wordcount =
job (both Spark and MapReduce) is a single-block file in HDFS which is two-=
way replicated (replication factor =3D 2). I ran wordcount on YARN for 10 t=
imes. The results show that only 30% of tasks can achieve data locality, wh=
ich seems like the result of a random placement of tasks. The experiment de=
tails are in the attachment, and feel free to reproduce the experiments.


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org