hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1901) Jobs should not submit the same jar files over and over again
Date Mon, 16 Aug 2010 09:34:22 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898835#action_12898835

Joydeep Sen Sarma commented on MAPREDUCE-1901:

the proposal is below - rephrases some of the discussions above, addresses some of the comments
around race conditions and points out limitations. Junjie will post a patch tomorrow (which
probably needs some more work).

h4. Background

Hadoop map-reduce jobs commonly require jars, executables, archives and other resources for
task execution on hadoop cluster nodes. A common deployment pattern for Hadoop applications
is that the required resources are deployed centrally by administrators (either on a shared
file system or deployed on standard local file system paths by package management tools).
Users launch Hadoop jobs from these installation points. Applications use apis (-libjars/files/archives)
provided by Hadoop to upload resources (from the installation point) so that they are made
available for task execution. This behavior makes deployment of Hadoop applications very easy
(just use standard package management tools).

As an example, Facebook has a few different Hive installations (of different versions) deployed
on NFS filer. Each has a multitude of jar files - with only some differing across different
Hive versions. Users also maintain a repository of map-reduce scripts and jar files contain
Hive extensions (user defined functions) on a NFS filer. Any installation of Hive can be used
to execute jobs against any of multiple map-reduce clusters. Most of the jar files are also
required locally (by the Hive client) - either for query compilation or for local execution
(either in hadoop local mode or for some special types of queries).

h4. Problems

With the above arrangement - each (non local-mode) Hadoop job will upload all the required
jar files into HDFS. TaskTrackers will download these jars from HDFS (at most once per job)
and check modification times of downloaded files (second task onwards) The following overheads
are observed:

- Job submission latency is impacted because of the need to serially upload multiple jar files
into HDFS. At Facebook - we typically see 5-6 seconds of pause in this stage (depends on how
responsive DFS is on a given day)
- There is some latency in setting up the first task as resources must be downloaded from
HDFS. We have typically observed this to be around 2-3 seconds at Facebook. 
- For subsequent tasks - the latency impact is not as high - but the mtime check adds to general
Namenode pressure.

h4. Observations

- jars and other resources are shared across different jobs and users. there are, in fact,
hardly any resources that are not shared.
- these resources are meant to be immutable 

We would like to use these properties to solve some of the overheads in the current protocol
while retaining the simplicity of the deployment model that exists today.

h4. General Approach

We would like to introduce (for lack of a better term) the notion of Content Addressible Resources
(CAR) that are stored in a central repository in HDFS:

# CAR jars/files/archives are be identified by their content (for example - named using their
md5 checksum).\\
    This allows different jobs to share resources. Each Job can find out whether the resources
required by it are already available in HDFS (by comparing the md5 signatures of their resources
against the contents in the CAR repository).\\
# Content Addressible resources (once uploaded) are immutable. They can only be garbage collected
(by server side daemons).
    This allows TaskTrackers to skip mtime checks on such resources.

The CAR functionality is exposed to clients in two ways:

- a boolean configuration option (defaulting to false) to indicate that resources added via
-libjars/files/archives options are content addressible
- enhancing the Distributed Cache api to mark specific files/archives as CAR (similar to how
specific files can be marked public)

h4. Protocol

Assume a jobclient has a CAR file _F_ on local disk to be uploaded for task execution. Here's
approximately a trace of what happens from the beginning of the job to it's end:

# Client computes the md5 signature of F (= _md5_F_)
#- One can additionally provide an option to skip this step - the md5 can be precomputed and
stored in a file named _F.md5_ stored alongside _F_. The client can look for and use the contents
of this file as the md5 sum.\\
# The client fetches (in a single filesystem call) the list of md5 signatures (and their 'atime'
attribute among other things) of the CAR repository\\
# If the _md5_F_ already exists in the CAR repository - then the client simply uses the URI
of the existing copy as the resource to be downloaded on the TaskTrackers\\
#- If the atime of _md5_F_ is older than 1 day, then the client updates the atime (See #6)\\
# If _md5_F_ does not exist in the CAR repository then Client uploads it to the CAR repository
using _md5_F_ as the name\\
# The TaskTracker, on being requested to run a task requiring CAR resource _md5_F_ checks
whether _md5_F_ is localized.\\
#- If _md5_F_ is already localized - then nothing more needs to be done. the localized version
is used by the Task\\
#- If _md5_F_ is not localized - then its fetched from the CAR repository\\
#  A garbage collector (running on the server side - preferably the JT) scans the CAR repository
periodically looking for and deleting resources whose atime is older than N days. This is
similar to the TrashEmptier in the Namenode.\\
# The number _N_ is configurable. The protocol guarantees that no job less than _N-1_ days
in length will have it's resources garbage collected before it finishes (because of the update
atime step in #3). In practice, the total size of the CAR repository is likely to be very
small (relative to other contents in HDFS) and _N_ can be set to a very high number. 

In this protocol - assuming that most jobs are using the same resources - the vast majority
of job submissions make only one file system call (to list the CAR repository on the job client).
Most task executions do not require any calls to the file system (for purposes of localization).
Note that uploads to the CAR repository will also be rare (in steady state).

h4. Notes

# The garbage collection of localized resources on TaskTrackers happens the same as today
(for resource downloaded via distributed cache).  In particular, no synchronization is required
between garbage collection of localized resources and those of the backing URIs in hdfs.\\
# In step #4 - in the v1 implementation, the client is responsible for computing the md5.
If the client is malicious - it can spoof the md5 (of important jars) and upload malicious
code thereby affecting the execution of other clients.\\
# In the v1 implementation - the CAR repository is implemented as a fixed directory in HDFS.
The clients must have write permission to the CAR directory (to upload new resources into
it). A malicious client can then delete or modify resources before they are eligible for garbage
collection - potentially affecting running jobs.

The latter two issues can be solved by having a server side agent control the addition and
deletion of resources to the CAR repository. However this has not been implemented in v1.
The initial implementation only suffices for environments that can make the assumption of
non-malicious clients - but can be extended to cover more security conscious use cases in
the future (with the attendant burden of more server side apis).

> Jobs should not submit the same jar files over and over again
> -------------------------------------------------------------
>                 Key: MAPREDUCE-1901
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>         Attachments: 1901.PATCH
> Currently each Hadoop job uploads the required resources (jars/files/archives) to a new
location in HDFS. Map-reduce nodes involved in executing this job would then download these
resources into local disk.
> In an environment where most of the users are using a standard set of jars and files
(because they are using a framework like Hive/Pig) - the same jars keep getting uploaded and
downloaded repeatedly. The overhead of this protocol (primarily in terms of end-user latency)
is significant when:
> - the jobs are small (and conversantly - large in number)
> - Namenode is under load (meaning hdfs latencies are high and made worse, in part, by
this protocol)
> Hadoop should provide a way for jobs in a cooperative environment to not submit the same
files over and again. Identifying and caching execution resources by a content signature (md5/sha)
would be a good alternative to have available.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message