Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Date: Sun, 9 Jun 2013 00:39:20 +0000 (UTC)
From: "Ivan Mitic (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12649621.1369721727521.90241.1370738360402@arcas>
In-Reply-To: <JIRA.12649621.1369721727521@arcas>
References: <JIRA.12649621.1369721727521@arcas>
Subject: [jira] [Updated] (MAPREDUCE-5278) Distributed cache is broken when
 JT staging dir is not on the default FS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Mitic updated MAPREDUCE-5278:
----------------------------------

    Summary: Distributed cache is broken when JT staging dir is not on the =
default FS  (was: Perf: Distributed cache is broken when JT staging dir is =
not on the default FS)
   =20
> Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1-win
>         Environment: Windows
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>             Fix For: 1-win
>
>         Attachments: MAPREDUCE-5278.patch
>
>
> Today, the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir=
) is set to point to HDFS, even though other file systems (e.g. Amazon S3 f=
ile system and Windows ASV file system) are the default file systems.
> For ASV, this config was chosen and there are a few reasons why:
> 1. To prevent leak of the storage account credentials to the user's stora=
ge account;=20
> 2. It uses HDFS for the transient job files what is good for two reasons =
=E2=80=93 a) it does not flood the user's storage account with irrelevant d=
ata/files b) it leverages HDFS locality for small files
> However, this approach conflicts with how distributed cache caching works=
, completely negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjar=
s hadoop generic options), they are copied to the job tracker staging dir o=
nly if they reside on a file system different that the jobtracker's. Later =
on, this path is used as a "key" to cache the files locally on the tasktrac=
ker's machine, and avoid localization (download/unzip) of the distributed c=
ache files if they are already localized.
> In this configuration the caching is completely disabled and we always en=
d up copying dist cache files to the job tracker's staging dir first and lo=
calizing them on the task tracker machine second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache =
to populate Hive/Pig jars throughout the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira