Return-Path: Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: (qmail 76340 invoked from network); 18 Nov 2010 11:13:35 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Nov 2010 11:13:35 -0000 Received: (qmail 46851 invoked by uid 500); 18 Nov 2010 11:14:06 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 46743 invoked by uid 500); 18 Nov 2010 11:14:05 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 46735 invoked by uid 99); 18 Nov 2010 11:14:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Nov 2010 11:14:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Nov 2010 11:14:02 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id ACF9723888A6; Thu, 18 Nov 2010 11:12:48 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1036409 - in /hadoop/mapreduce/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/gridmix.xml Date: Thu, 18 Nov 2010 11:12:48 -0000 To: mapreduce-commits@hadoop.apache.org From: vinodkv@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20101118111248.ACF9723888A6@eris.apache.org> Author: vinodkv Date: Thu Nov 18 11:12:48 2010 New Revision: 1036409 URL: http://svn.apache.org/viewvc?rev=1036409&view=rev Log: MAPREDUCE-1931. Gridmix forrest documentation. Contributed by Ranjit Mathew. Modified: hadoop/mapreduce/trunk/CHANGES.txt hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml Modified: hadoop/mapreduce/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=1036409&r1=1036408&r2=1036409&view=diff ============================================================================== --- hadoop/mapreduce/trunk/CHANGES.txt (original) +++ hadoop/mapreduce/trunk/CHANGES.txt Thu Nov 18 11:12:48 2010 @@ -186,6 +186,8 @@ Release 0.22.0 - Unreleased MAPREDUCE-2167. Faster directory traversal for raid node. (Ramkumar Vadali via schen) + MAPREDUCE-1931. Gridmix forrest documentation . (Ranjit Mathew via vinodkv). + OPTIMIZATIONS MAPREDUCE-1354. Enhancements to JobTracker for better performance and Modified: hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml?rev=1036409&r1=1036408&r2=1036409&view=diff ============================================================================== --- hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml (original) +++ hadoop/mapreduce/trunk/src/docs/src/documentation/content/xdocs/gridmix.xml Thu Nov 18 11:12:48 2010 @@ -15,150 +15,542 @@ See the License for the specific language governing permissions and limitations under the License. --> - - - -
- Gridmix -
- - - -
- Overview - -

Gridmix is a benchmark for live clusters. It submits a mix of synthetic - jobs, modeling a profile mined from production loads.

- -

There exist three versions of the Gridmix tool. This document discusses - the third (checked into contrib), distinct from the two checked into the - benchmarks subdirectory. While the first two versions of the tool included - stripped-down versions of common jobs, both were principally saturation - tools for stressing the framework at scale. In support of a broader range of - deployments and finer-tuned job mixes, this version of the tool will attempt - to model the resource profiles of production jobs to identify bottlenecks, - guide development, and serve as a replacement for the existing gridmix - benchmarks.

- -
- -
- - Usage - -

To run Gridmix, one requires a job trace describing the job mix for a - given cluster. Such traces are typically genenerated by Rumen (see related - documentation). Gridmix also requires input data from which the synthetic - jobs will draw bytes. The input data need not be in any particular format, - as the synthetic jobs are currently binary readers. If one is running on a - new cluster, an optional step generating input data may precede the run.

- -

Basic command line usage:

- - -bin/mapred org.apache.hadoop.mapred.gridmix.Gridmix [-generate <MiB>] <iopath> <trace> - - -

The -generate parameter accepts standard units, e.g. - 100g will generate 100 * 230 bytes. The - <iopath> parameter is the destination directory for generated and/or - the directory from which input data will be read. The <trace> - parameter is a path to a job trace. The following configuration parameters - are also accepted in the standard idiom, before other Gridmix - parameters.

- -
- Configuration parameters -

- - - - - - - - - - - - - - -
Parameter Description Notes
gridmix.output.directoryThe directory into which output will be written. If specified, the - iopath will be relative to this parameter.The submitting user must have read/write access to this - directory. The user should also be mindful of any quota issues that - may arise during a run.
gridmix.client.submit.threadsThe number of threads submitting jobs to the cluster. This also - controls how many splits will be loaded into memory at a given time, - pending the submit time in the trace.Splits are pregenerated to hit submission deadlines, so - particularly dense traces may want more submitting threads. However, - storing splits in memory is reasonably expensive, so one should raise - this cautiously.
gridmix.client.pending.queue.depthThe depth of the queue of job descriptions awaiting split - generation.The jobs read from the trace occupy a queue of this depth before - being processed by the submission threads. It is unusual to configure - this.
gridmix.min.key.lengthThe key size for jobs submitted to the cluster.While this is clearly a job-specific, even task-specific property, - no data on key length is currently available. Since the intermediate - data are random, memcomparable data, not even the sort is likely - affected. It exists as a tunable as no default value is appropriate, - but future versions will likely replace it with trace data.
- -
-
- -
- - Simplifying Assumptions - -

Gridmix will be developed in stages, incorporating feedback and patches - from the community. Currently, its intent is to evaluate Map/Reduce and HDFS - performance and not the layers on top of them (i.e. the extensive lib and - subproject space). Given these two limitations, the following - characteristics of job load are not currently captured in job traces and - cannot be accurately reproduced in Gridmix.

- - - - - - - - - - -
PropertyNotes
CPU usageWe have no data for per-task CPU usage, so we - cannot attempt even an approximation. Gridmix tasks are never CPU bound - independent of I/O, though this surely happens in practice.
Filesystem propertiesNo attempt is made to match block - sizes, namespace hierarchies, or any property of input, intermediate, or - output data other than the bytes/records consumed and emitted from a given - task. This implies that some of the most heavily used parts of the system- - the compression libraries, text processing, streaming, etc.- cannot be - meaningfully tested with the current implementation.
I/O ratesThe rate at which records are consumed/emitted is - assumed to be limited only by the speed of the reader/writer and constant - throughout the task.
Memory profileNo data on tasks' memory usage over time is - available, though the max heap size is retained.
SkewThe records consumed and emitted to/from a given task - are assumed to follow observed averages, i.e. records will be more regular - than may be seen in the wild. Each map also generates a proportional - percentage of data for each reduce, so a job with unbalanced input will be - flattened.
Job failureUser code is assumed to be correct.
Job independenceThe output or outcome of one job does not - affect when or whether a subsequent job will run.
- -
- -
- - Appendix - -

Issues tracking the implementations of gridmix1, gridmix2, and - gridmix3. - Other issues tracking the development of Gridmix can be found by searching - the Map/Reduce JIRA

- -
- - - +
+ GridMix +
+ +
+ Overview +

GridMix is a benchmark for Hadoop clusters. It submits a mix of + synthetic jobs, modeling a profile mined from production loads.

+

There exist three versions of the GridMix tool. This document + discusses the third (checked into src/contrib), distinct + from the two checked into the src/benchmarks sub-directory. + While the first two versions of the tool included stripped-down versions + of common jobs, both were principally saturation tools for stressing the + framework at scale. In support of a broader range of deployments and + finer-tuned job mixes, this version of the tool will attempt to model + the resource profiles of production jobs to identify bottlenecks, guide + development, and serve as a replacement for the existing GridMix + benchmarks.

+

To run GridMix, you need a MapReduce job trace describing the job mix + for a given cluster. Such traces are typically generated by Rumen (see + Rumen documentation). GridMix also requires input data from which the + synthetic jobs will be reading bytes. The input data need not be in any + particular format, as the synthetic jobs are currently binary readers. + If you are running on a new cluster, an optional step generating input + data may precede the run.

+

In order to emulate the load of production jobs from a given cluster + on the same or another cluster, follow these steps:

+
    +
  1. Locate the job history files on the production cluster. This + location is specified by the + mapreduce.jobtracker.jobhistory.completed.location + configuration property of the cluster.
  2. +
  3. Run Rumen to build a job trace in JSON format for all or select + jobs.
  4. +
  5. (Optional) Use Rumen to fold this job trace to scale + the load.
  6. +
  7. Use GridMix with the job trace on the benchmark cluster.
  8. +
+

Jobs submitted by GridMix have names of the form + "GRIDMIXnnnnn", where + "nnnnn" is a sequence number padded with leading + zeroes.

+
+
+ Usage +

Basic command-line usage without configuration parameters:

+ +org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace> + +

Basic command-line usage with configuration parameters:

+ +org.apache.hadoop.mapred.gridmix.Gridmix \ + -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \ + [-generate <size>] [-users <users-list>] <iopath> <trace> + + + Configuration parameters like + -Dgridmix.client.submit.threads=10 and + -Dgridmix.output.directory=foo as given above should + be used before other GridMix parameters. + +

The -generate option is used to generate input data + for the synthetic jobs. It accepts size suffixes, e.g. + 100g will generate 100 * 230 bytes.

+

The -users option is used to point to a users-list + file (see Emulating Users and Queues).

+

The <iopath> parameter is the destination + directory for generated output and/or the directory from which input + data will be read. Note that this can either be on the local file-system + or on HDFS, but it is highly recommended that it be the same as that for + the original job mix so that GridMix puts the same load on the local + file-system and HDFS respectively.

+

The <trace> parameter is a path to a job trace + generated by Rumen. This trace can be compressed (it must be readable + using one of the compression codecs supported by the cluster) or + uncompressed. Use "-" as the value of this parameter if you + want to pass an uncompressed trace via the standard + input-stream of GridMix.

+

The class org.apache.hadoop.mapred.gridmix.Gridmix can + be found in the JAR + contrib/gridmix/hadoop-$VERSION-gridmix.jar inside your + Hadoop installation, where $VERSION corresponds to the + version of Hadoop installed. A simple way of ensuring that this class + and all its dependencies are loaded correctly is to use the + hadoop wrapper script in Hadoop:

+ +hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \ + [-generate <size>] [-users <users-list>] <iopath> <trace> + +

The supported configuration parameters are explained in the + following sections.

+
+
+ General Configuration Parameters +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.output.directory + The directory into which output will be written. If specified, + iopath will be relative to this parameter. The + submitting user must have read/write access to this directory. The + user should also be mindful of any quota issues that may arise + during a run. The default is "gridmix".
+ gridmix.client.submit.threads + The number of threads submitting jobs to the cluster. This + also controls how many splits will be loaded into memory at a given + time, pending the submit time in the trace. Splits are pre-generated + to hit submission deadlines, so particularly dense traces may want + more submitting threads. However, storing splits in memory is + reasonably expensive, so you should raise this cautiously. The + default is 1 for the SERIAL job-submission policy (see + Job Submission Policies) and one more than + the number of processors on the client machine for the other + policies.
+ gridmix.submit.multiplier + The multiplier to accelerate or decelerate the submission of + jobs. The time separating two jobs is multiplied by this factor. + The default value is 1.0. This is a crude mechanism to size + a job trace to a cluster.
+ gridmix.client.pending.queue.depth + The depth of the queue of job descriptions awaiting split + generation. The jobs read from the trace occupy a queue of this + depth before being processed by the submission threads. It is + unusual to configure this. The default is 5.
+ gridmix.gen.blocksize + The block-size of generated data. The default value is 256 + MiB.
+ gridmix.gen.bytes.per.file + The maximum bytes written per file. The default value is 1 + GiB.
+ gridmix.min.file.size + The minimum size of the input files. The default limit is 128 + MiB. Tweak this parameter if you see an error-message like + "Found no satisfactory file" while testing GridMix with + a relatively-small input data-set.
+ gridmix.max.total.scan + The maximum size of the input files. The default limit is 100 + TiB.
+

+
+ Job Types +

GridMix takes as input a job trace, essentially a stream of + JSON-encoded job descriptions. For each job description, the submission + client obtains the original job submission time and for each task in + that job, the byte and record counts read and written. Given this data, + it constructs a synthetic job with the same byte and record patterns as + recorded in the trace. It constructs jobs of two types:

+ + + + + + + + + + + + + +
Job TypeDescription
+ LOADJOB + A synthetic job that emulates the workload mentioned in Rumen + trace. In the current version we are supporting I/O. It reproduces + the I/O workload on the benchmark cluster. It does so by embedding + the detailed I/O information for every map and reduce task, such as + the number of bytes and records read and written, into each + job's input splits. The map tasks further relay the I/O patterns of + reduce tasks through the intermediate map output data.
+ SLEEPJOB + A synthetic job where each task does nothing but sleep + for a certain duration as observed in the production trace. The + scalability of the Job Tracker is often limited by how many + heartbeats it can handle every second. (Heartbeats are periodic + messages sent from Task Trackers to update their status and grab new + tasks from the Job Tracker.) Since a benchmark cluster is typically + a fraction in size of a production cluster, the heartbeat traffic + generated by the slave nodes is well below the level of the + production cluster. One possible solution is to run multiple Task + Trackers on each slave node. This leads to the obvious problem that + the I/O workload generated by the synthetic jobs would thrash the + slave nodes. Hence the need for such a job.
+

The following configuration parameters affect the job type:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job.type + The value for this key can be one of LOADJOB or SLEEPJOB. The + default value is LOADJOB.
+ gridmix.key.fraction + For a LOADJOB type of job, the fraction of a record used for + the data for the key. The default value is 0.1.
+ gridmix.sleep.maptask-only + For a SLEEPJOB type of job, whether to ignore the reduce + tasks for the job. The default is false.
+ gridmix.sleep.fake-locations + For a SLEEPJOB type of job, the number of fake locations + for map tasks for the job. The default is 0.
+ gridmix.sleep.max-map-time + For a SLEEPJOB type of job, the maximum runtime for map + tasks for the job in milliseconds. The default is unlimited.
+ gridmix.sleep.max-reduce-time + For a SLEEPJOB type of job, the maximum runtime for reduce + tasks for the job in milliseconds. The default is unlimited.
+
+
+ Job Submission Policies +

GridMix controls the rate of job submission. This control can be + based on the trace information or can be based on statistics it gathers + from the Job Tracker. Based on the submission policies users define, + GridMix uses the respective algorithm to control the job submission. + There are currently three types of policies:

+ + + + + + + + + + + + + + + + + +
Job Submission PolicyDescription
+ STRESS + Keep submitting jobs so that the cluster remains under stress. + In this mode we control the rate of job submission by monitoring + the real-time load of the cluster so that we can maintain a stable + stress level of workload on the cluster. Based on the statistics we + gather we define if a cluster is underloaded or + overloaded. We consider a cluster underloaded if + and only if the following three conditions are true: +
    +
  1. the number of pending and running jobs are under a threshold + TJ
  2. +
  3. the number of pending and running maps are under threshold + TM
  4. +
  5. the number of pending and running reduces are under threshold + TR
  6. +
+ The thresholds TJ, TM and TR are proportional to the size of the + cluster and map, reduce slots capacities respectively. In case of a + cluster being overloaded, we throttle the job submission. + In the actual calculation we also weigh each running task with its + remaining work - namely, a 90% complete task is only counted as 0.1 + in calculation. Finally, to avoid a very large job blocking other + jobs, we limit the number of pending/waiting tasks each job can + contribute.
+ REPLAY + In this mode we replay the job traces faithfully. This mode + exactly follows the time-intervals given in the actual job + trace.
+ SERIAL + In this mode we submit the next job only once the job submitted + earlier is completed.
+

The following configuration parameters affect the job submission + policy:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job-submission.policy + The value for this key would one of the three: STRESS, REPLAY or + SERIAL. In most of the cases the value of key would be STRESS or + REPLAY. The default value is STRESS.
+ gridmix.throttle.jobs-to-tracker-ratio + In STRESS mode, the minimum ratio of running jobs to Task + Trackers in a cluster for the cluster to be considered + overloaded. This is the threshold TJ referred to earlier. + The default is 1.0.
+ gridmix.throttle.maps.task-to-slot-ratio + In STRESS mode, the minimum ratio of pending and running map + tasks (i.e. incomplete map tasks) to the number of map slots for + a cluster for the cluster to be considered overloaded. + This is the threshold TM referred to earlier. Running map tasks are + counted partially. For example, a 40% complete map task is counted + as 0.6 map tasks. The default is 2.0.
+ gridmix.throttle.reduces.task-to-slot-ratio + In STRESS mode, the minimum ratio of pending and running reduce + tasks (i.e. incomplete reduce tasks) to the number of reduce slots + for a cluster for the cluster to be considered overloaded. + This is the threshold TR referred to earlier. Running reduce tasks + are counted partially. For example, a 30% complete reduce task is + counted as 0.7 reduce tasks. The default is 2.5.
+ gridmix.throttle.maps.max-slot-share-per-job + In STRESS mode, the maximum share of a cluster's map-slots + capacity that can be counted toward a job's incomplete map tasks in + overload calculation. The default is 0.1.
+ gridmix.throttle.reducess.max-slot-share-per-job + In STRESS mode, the maximum share of a cluster's reduce-slots + capacity that can be counted toward a job's incomplete reduce tasks + in overload calculation. The default is 0.1.
+
+
+ Emulating Users and Queues +

Typical production clusters are often shared with different users and + the cluster capacity is divided among different departments through job + queues. Ensuring fairness among jobs from all users, honoring queue + capacity allocation policies and avoiding an ill-behaving job from + taking over the cluster adds significant complexity in Hadoop software. + To be able to sufficiently test and discover bugs in these areas, + GridMix must emulate the contentions of jobs from different users and/or + submitted to different queues.

+

Emulating multiple queues is easy - we simply set up the benchmark + cluster with the same queue configuration as the production cluster and + we configure synthetic jobs so that they get submitted to the same queue + as recorded in the trace. However, not all users shown in the trace have + accounts on the benchmark cluster. Instead, we set up a number of testing + user accounts and associate each unique user in the trace to testing + users in a round-robin fashion.

+

The following configuration parameters affect the emulation of users + and queues:

+ + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job-submission.use-queue-in-trace + When set to true it uses exactly the same set of + queues as those mentioned in the trace. The default value is + false.
+ gridmix.job-submission.default-queue + Specifies the default queue to which all the jobs would be + submitted. If this parameter is not specified, GridMix uses the + default queue defined for the submitting user on the cluster.
+ gridmix.user.resolve.class + Specifies which UserResolver implementation to use. + We currently have three implementations: +
    +
  1. org.apache.hadoop.mapred.gridmix.EchoUserResolver + - submits a job as the user who submitted the original job. All + the users of the production cluster identified in the job trace + must also have accounts on the benchmark cluster in this case.
  2. +
  3. org.apache.hadoop.mapred.gridmix.SubmitterUserResolver + - submits all the jobs as current GridMix user. In this case we + simply map all the users in the trace to the current GridMix user + and submit the job.
  4. +
  5. org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver + - maps trace users to test users in a round-robin fashion. In + this case we set up a number of testing user accounts and + associate each unique user in the trace to testing users in a + round-robin fashion.
  6. +
+ The default is + org.apache.hadoop.mapred.gridmix.SubmitterUserResolver.
+

If the parameter gridmix.user.resolve.class is set to + org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver, + we need to define a users-list file with a list of test users and groups. + This is specified using the -users option to GridMix.

+ + Specifying a users-list file using the -users option is + mandatory when using the round-robin user-resolver. Other user-resolvers + ignore this option. + +

A users-list file has one user-group-information (UGI) per line, each + UGI of the format:

+ + <username>,<group>[,group]* + +

For example:

+ + user1,group1 + user2,group2,group3 + user3,group3,group4 + +

In the above example we have defined three users user1, + user2 and user3 with their respective groups. + Now we would associate each unique user in the trace to the above users + defined in round-robin fashion. For example, if traces users are + tuser1, tuser2, tuser3, + tuser4 and tuser5, then the mappings would + be:

+ + tuser1 -> user1 + tuser2 -> user2 + tuser3 -> user3 + tuser4 -> user1 + tuser5 -> user2 + +
+
+ Simplifying Assumptions +

GridMix will be developed in stages, incorporating feedback and + patches from the community. Currently its intent is to evaluate + MapReduce and HDFS performance and not the layers on top of them (i.e. + the extensive lib and sub-project space). Given these two limitations, + the following characteristics of job load are not currently captured in + job traces and cannot be accurately reproduced in GridMix:

+
    +
  • CPU Usage - We have no data for per-task CPU usage, so we + cannot even attempt an approximation. GridMix tasks are never + CPU-bound independent of I/O, though this surely happens in + practice.
  • +
  • Filesystem Properties - No attempt is made to match block + sizes, namespace hierarchies, or any property of input, intermediate + or output data other than the bytes/records consumed and emitted from + a given task. This implies that some of the most heavily-used parts of + the system - the compression libraries, text processing, streaming, + etc. - cannot be meaningfully tested with the current + implementation.
  • +
  • I/O Rates - The rate at which records are + consumed/emitted is assumed to be limited only by the speed of the + reader/writer and constant throughout the task.
  • +
  • Memory Profile - No data on tasks' memory usage over time + is available, though the max heap-size is retained.
  • +
  • Skew - The records consumed and emitted to/from a given + task are assumed to follow observed averages, i.e. records will be + more regular than may be seen in the wild. Each map also generates + a proportional percentage of data for each reduce, so a job with + unbalanced input will be flattened.
  • +
  • Job Failure - User code is assumed to be correct.
  • +
  • Job Independence - The output or outcome of one job does + not affect when or whether a subsequent job will run.
  • +
+
+
+ Appendix +

Issues tracking the original implementations of GridMix1, + GridMix2, + and GridMix3 + can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking + the current development of GridMix can be found by searching the + Apache Hadoop MapReduce JIRA

+
+