hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xi Fang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5278) Distributed cache is broken when JT staging dir is not on the default FS
Date Mon, 17 Jun 2013 21:54:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13686060#comment-13686060
] 

Xi Fang commented on MAPREDUCE-5278:
------------------------------------

Thanks Bikas for your comments. For your question : "Is the following code (marked below)
continuing to copy stuff to the default fs (fs) when the newPath points to a different filesystem?:

No. Basically, the original code does this: If JT staging dir is not on the default FS (for
example, in our context it is ASV), copyRemoteFiles() will copy files in ASV to JT. Note that
these files are specified using generic options.
After our change, when ASV is marked as "accessible" by specifying "mapreduce.client.accessible.remote.schemes",
 copyRemoteFiles() won't copy the files in ASV  to the jobtracker. It just directly returns
the path of that file, denoted by "newPath".  In addition, no copy operation would happen
in addArchiveToClassPath().  
                
> Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1-win
>         Environment: Windows
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>             Fix For: 1-win
>
>         Attachments: MAPREDUCE-5278.2.patch, MAPREDUCE-5278.patch
>
>
> Today, the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir) is set to
point to HDFS, even though other file systems (e.g. Amazon S3 file system and Windows ASV
file system) are the default file systems.
> For ASV, this config was chosen and there are a few reasons why:
> 1. To prevent leak of the storage account credentials to the user's storage account;

> 2. It uses HDFS for the transient job files what is good for two reasons – a) it does
not flood the user's storage account with irrelevant data/files b) it leverages HDFS locality
for small files
> However, this approach conflicts with how distributed cache caching works, completely
negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjars hadoop generic
options), they are copied to the job tracker staging dir only if they reside on a file system
different that the jobtracker's. Later on, this path is used as a "key" to cache the files
locally on the tasktracker's machine, and avoid localization (download/unzip) of the distributed
cache files if they are already localized.
> In this configuration the caching is completely disabled and we always end up copying
dist cache files to the job tracker's staging dir first and localizing them on the task tracker
machine second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache to populate
Hive/Pig jars throughout the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message