hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Gummadi (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-2407) Make Gridmix emulate usage of Distributed Cache files
Date Thu, 12 May 2011 08:19:47 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ravi Gummadi updated MAPREDUCE-2407:

    Attachment: 2407.patch

Attaching patch that adds emulation of distributed cache load in gridmix simulated jobs.

High level details of what this patch does are:

(1) New gridmix configuration property "gridmix.distributed-cache-emulation.enable" is added,
whose default value is true. Setting it to false disables emulation of distributed cache load.
Irrespective of this config property setting, with -generate option, distributed cache files
are generated on HDFS by gridmix.
Distributed Cache Emulation is disabled for the case of '-' as input trace(i.e. stdin stream
instead of file).
Distributed Cache Emulation is disabled for the case where <iopath> is on local file

(2) Behavior of the option -generate is changed. -generate option means (a) generate input
data in the directory
<iopath>/input/ and (b) generate distributed cache data needed for emulation of distributed
cache load of this
trace file in the directory <iopath>/distributedCache/.
For (a), same old GenerateData MR job is used.
For (b), a new MR job GenerateDistCacheData is added, which is run after GenerateData and
before submission of simulated jobs.

With -generate option, (a) existence of <iopath>/input/ directory gives an error, similar
to current behavior and
(b) existence of <iopath>/gridmixDistCache/ directory is not an error and leads to generation
of only the missing/nonexisting distributed cache files under <iopath>/gridmixDistCache/
for the specific trace file. If all the needed distributed cache files are already
there, then submission of GenerateDistCacheData job is skipped.

Without -generate option, if emulation of distributed cache load is enabled, then gridmix
checks if all the needed distributed cache files are available under <iopath>/distributedCache/
and emits an error if any of the expected files are missing.

(3) setupDistCacheEmulation : Read the trace file and build a list of distributed cache file
paths and their file sizes. The
file paths are the mapped paths on the simulated cluster(mapped from original cluster's paths
to simulated cluster's
paths using
{code}MD5Hash(filePath+timestamp){code} for public distributed cache files
{code}MD5Hash(filePath+timestamp+username){code} for private distributed cache files.

This list of mappeed file paths along with the file sizes is written to a special file
<iopath>/distributedCache/_distCacheFiles.txt and the file name can be configured using

So this means all distributed cache files in the gridmix simulated jobs are public distributed
cache files but for each private distributed cache file of a user of the original cluster
(i.e. from trace file), there will be a different public distributed cache file on gridmix
simulated cluster.

(4) GenerateDistCacheData : The MR job (launched by gridmix if -generate option is seen) that
generates distributed cache data files on HDFS. Input to this job is the special file _distCacheFiles.txt
that contains the distributed cache file paths and their sizes.
Each map() call generates one distributed cache file.

(5) configureDistCacheFiles : The mapped distributed cache files' paths are configured for
the simulated jobs' configrations sothat MapReduce framework takes care of adding the actual
distributed cache load equivalent to original cluster's distributed cache load.

> Make Gridmix emulate usage of Distributed Cache files
> -----------------------------------------------------
>                 Key: MAPREDUCE-2407
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2407
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/gridmix
>            Reporter: Ravi Gummadi
>            Assignee: Ravi Gummadi
>         Attachments: 2407.patch
> Currently Gridmix emulates disk IO load only. This JIRA is to make Gridmix emulate Distributed
Cache load as defined by the job-trace.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message