hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
Date Fri, 13 Aug 2010 19:26:23 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898376#action_12898376
] 

Joydeep Sen Sarma commented on MAPREDUCE-1901:
----------------------------------------------

> if the jar changed resulting in different tasks of the same job getting different data
rendering debugging impossible!

prior to security related functionality (and even with it if the user is capricious enough)
- i think this was very much possible. the only contract that the TT's seem to follow today
is that the launched task sees the latest possible version of a required resource (jar/file
etc.). there's no contract that says that each task sees identical version of each resource.
the mtime check only seems to reinforce this notion - that resources could be changing underneath
as the job is running.

that said - there will likely be security holes in our current scheme. we will post a spec
and trunk patch by sometime later today (or weekend) hopefully.

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
>
>
> Currently each Hadoop job uploads the required resources (jars/files/archives) to a new
location in HDFS. Map-reduce nodes involved in executing this job would then download these
resources into local disk.
> In an environment where most of the users are using a standard set of jars and files
(because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and
downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency)
is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in part, by
this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not submit the same
files over and again. Identifying and caching execution resources by a content signature (md5/sha)
would be a good alternative to have available.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message