hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xi Fang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-5278) Perf: Distributed cache is broken when JT staging dir is not on the default FS
Date Tue, 28 May 2013 06:22:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668117#comment-13668117
] 

Xi Fang commented on MAPREDUCE-5278:
------------------------------------

Basically, if a remote file system is reachable from task trackers, we don't have to copy
the files on this file system to the job tracker's staging (see JobClient#copyRemoteFiles()
). 

For example, in HDInsight, user storage would be ASV which is different than HDFS. So by default
these files would be copied to JT. However, since ASV is supposed to be reachable from tasktracker,
these copy operations would be unnecessary, which will also disable the dist cache.  A proposal
is to add a configuration property (e.g. "mapred.tasktracker.scheme.accessible"). If we specify
a scheme in this property, we won't do the copy operation even if this scheme is not equal
to the scheme of job tracker's staging dir. For example, in this context, mapred.tasktracker.scheme.accessible=ASV.
                
> Perf: Distributed cache is broken when JT staging dir is not on the default FS
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5278
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5278
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1-win
>         Environment: Windows
>            Reporter: Xi Fang
>            Assignee: Xi Fang
>
> Today, we set the JobTracker staging dir ("mapreduce.jobtracker.staging.root.dir) to
point to HDFS even though ASV is the default file system. There are a few reason why this
config was chosen:
> 1. To prevent leak of the storage account creds to the user's storage account (IOW, keep
job.xml in the cluster). 
> 2. It uses HDFS for the transient job files what is good for two reasons – a) it does
not flood the user's storage account with irrelevant data/files b) it leverages HDFS locality
for small files
> However, this approach conflicts with how distributed cache caching works, completely
negating the feature's functionality.
> When files are added to the distributed cache (thru files/achieves/libjars hadoop generic
options), they are copied to the job tracker staging dir only if they reside on a file system
different that the jobtracker's. Later on, this path is used as a "key" to cache the files
locally on the tasktracker's machine, and avoid localization (download/unzip) of the distributed
cache files if they are already localized.
> In our configuration the caching is completely disabled and we always end up copying
dist cache files to the JT staging dir first and localizing them on the tasktracker machine
second.
> This is especially not good for Oozie scenarios as Oozie uses dist cache to populate
Hive/Pig jars throughout the cluster.
> Easy workaround is to config mapreduce.jobtracker.staging.root.dir in mapred-site.xml
to be on the default FS.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message