hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. C. Srivas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
Date Mon, 16 Aug 2010 15:00:24 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898956#action_12898956

M. C. Srivas commented on MAPREDUCE-1901:

Content-addressable is one way to solve this problem, and it seems like an extremely heavy-weight
   1. more processing to do whenever a file is added to the file-system
   2. reliability issues getting the signature to match the contents across failures/re-replication/etc
   3. a repository of signatures in HDFS is yet another single-point of failure, and yet another
database that needs to be maintained (recovery code to handle "no-data-corruption" on a reboot,
scaling it as more files added, backup/restore,  HA, etc)

Looks like there are a variety simpler approaches possible, a few of which come to mind immediately,
and are list below in increasing order of complexity.

  1. use distcp or something similar to copy the files onto local disk whenever there is a
new version of Hive released , and set pathnames to that. That is,  different versions of
a set of files are kept in a different directory, and pathnames are used to distinguish them.
For example,  we do not do a md5 check of "/bin/ls" every time we need to run it. We set our
pathname appropriately. If there is a different version of  "ls" we prefer to use, say, in
 "/my/local/bin", then we get that by setting /my/local/bin  ahead of other paths in our pathname.

  2. instead of implementing a bulk "getSignatures" call to replace several "get_mtime" calls,
why not implement a  bulk get_mtime instead? 

  3. use a model like AFS  with callbacks to implement a on-disk cache that survives reboots
(Dhruba knows AFS very well).  In other words, the client acquires a callback from the name-node
for each file it has cached, and HDFS gurantees it will notify the client when the file is
deleted or changed (at which point, the callback is revoked and the client must re-fetch the
file). The callback lasts for, say, 1 week, and can be persisted on disk.  On a name-node
reboot, the client is responsible for re-establishing the callbacks it already has (akin to
a block-report). The client can also choose to return callbacks, in order to keep the memory
requirements on the name-node to a minimum.  No repository of signatures is needed.

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
> Currently each Hadoop job uploads the required resources (jars/files/archives) to a new
location in HDFS. Map-reduce nodes involved in executing this job would then download these
resources into local disk.
> In an environment where most of the users are using a standard set of jars and files
(because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and
downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency)
is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in part, by
this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not submit the same
files over and again. Identifying and caching execution resources by a content signature (md5/sha)
would be a good alternative to have available.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message