hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
Date Wed, 30 Jun 2010 19:22:50 GMT
Jobs should not submit the same jar files over and over again

                 Key: MAPREDUCE-1901
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Joydeep Sen Sarma

Currently each Hadoop job uploads the required resources (jars/files/archives) to a new location
in HDFS. Map-reduce nodes involved in executing this job would then download these resources
into local disk.

In an environment where most of the users are using a standard set of jars and files (because
they are using a framework like Hive/Pig) - the same jars keep getting uploaded and downloaded
repeatedly. The overhead of this protocol (primarily in terms of end-user latency) is significant
- the jobs are small (and conversantly - large in number)
- Namenode is under load (meaning hdfs latencies are high and made worse, in part, by this

Hadoop should provide a way for jobs in a cooperative environment to not submit the same files
over and again. Identifying and caching execution resources by a content signature (md5/sha)
would be a good alternative to have available.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message