Return-Path: Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: (qmail 80060 invoked from network); 8 Oct 2009 16:42:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Oct 2009 16:42:23 -0000 Received: (qmail 84329 invoked by uid 500); 8 Oct 2009 16:42:23 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 84301 invoked by uid 500); 8 Oct 2009 16:42:23 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 84287 invoked by uid 99); 8 Oct 2009 16:42:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Oct 2009 16:42:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Oct 2009 16:42:18 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 524F9238890E; Thu, 8 Oct 2009 16:41:57 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r823229 - in /hadoop/mapreduce/branches/branch-0.21: CHANGES.txt src/contrib/gridmix/README src/docs/src/documentation/content/xdocs/gridmix.xml src/docs/src/documentation/content/xdocs/site.xml Date: Thu, 08 Oct 2009 16:41:57 -0000 To: mapreduce-commits@hadoop.apache.org From: cdouglas@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20091008164157.524F9238890E@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: cdouglas Date: Thu Oct 8 16:41:56 2009 New Revision: 823229 URL: http://svn.apache.org/viewvc?rev=823229&view=rev Log: MAPREDUCE-1063. Document gridmix benchmark. Added: hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml Modified: hadoop/mapreduce/branches/branch-0.21/CHANGES.txt hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml Modified: hadoop/mapreduce/branches/branch-0.21/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/CHANGES.txt?rev=823229&r1=823228&r2=823229&view=diff ============================================================================== --- hadoop/mapreduce/branches/branch-0.21/CHANGES.txt (original) +++ hadoop/mapreduce/branches/branch-0.21/CHANGES.txt Thu Oct 8 16:41:56 2009 @@ -414,6 +414,8 @@ MAPREDUCE-639. Change Terasort example to reflect the 2009 updates. (omalley) + MAPREDUCE-1063. Document gridmix benchmark. (cdouglas) + BUG FIXES MAPREDUCE-878. Rename fair scheduler design doc to Added: hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README?rev=823229&view=auto ============================================================================== --- hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README (added) +++ hadoop/mapreduce/branches/branch-0.21/src/contrib/gridmix/README Thu Oct 8 16:41:56 2009 @@ -0,0 +1,22 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +This project implements the third version of Gridmix, a benchmark for live +clusters. Given a description of jobs (a "trace") annotated with information +about I/O, memory, etc. a synthetic mix of jobs will be generated and submitted +to the cluster. + +Documentation of usage and configuration properties in forrest is available in +src/docs/src/documentation/content/xdocs/gridmix.xml Added: hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml?rev=823229&view=auto ============================================================================== --- hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml (added) +++ hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/gridmix.xml Thu Oct 8 16:41:56 2009 @@ -0,0 +1,164 @@ + + + + + + + +
+ Gridmix +
+ + + +
+ Overview + +

Gridmix is a benchmark for live clusters. It submits a mix of synthetic + jobs, modeling a profile mined from production loads.

+ +

There exist three versions of the Gridmix tool. This document discusses + the third (checked into contrib), distinct from the two checked into the + benchmarks subdirectory. While the first two versions of the tool included + stripped-down versions of common jobs, both were principally saturation + tools for stressing the framework at scale. In support of a broader range of + deployments and finer-tuned job mixes, this version of the tool will attempt + to model the resource profiles of production jobs to identify bottlenecks, + guide development, and serve as a replacement for the existing gridmix + benchmarks.

+ +
+ +
+ + Usage + +

To run Gridmix, one requires a job trace describing the job mix for a + given cluster. Such traces are typically genenerated by Rumen (see related + documentation). Gridmix also requires input data from which the synthetic + jobs will draw bytes. The input data need not be in any particular format, + as the synthetic jobs are currently binary readers. If one is running on a + new cluster, an optional step generating input data may precede the run.

+ +

Basic command line usage:

+ + +bin/mapred org.apache.hadoop.mapred.gridmix.Gridmix [-generate <MiB>] <iopath> <trace> + + +

The -generate parameter accepts standard units, e.g. + 100g will generate 100 * 230 bytes. The + <iopath> parameter is the destination directory for generated and/or + the directory from which input data will be read. The <trace> + parameter is a path to a job trace. The following configuration parameters + are also accepted in the standard idiom, before other Gridmix + parameters.

+ +
+ Configuration parameters +

+ + + + + + + + + + + + + + +
Parameter Description Notes
gridmix.output.directoryThe directory into which output will be written. If specified, the + iopath will be relative to this parameter.The submitting user must have read/write access to this + directory. The user should also be mindful of any quota issues that + may arise during a run.
gridmix.client.submit.threadsThe number of threads submitting jobs to the cluster. This also + controls how many splits will be loaded into memory at a given time, + pending the submit time in the trace.Splits are pregenerated to hit submission deadlines, so + particularly dense traces may want more submitting threads. However, + storing splits in memory is reasonably expensive, so one should raise + this cautiously.
gridmix.client.pending.queue.depthThe depth of the queue of job descriptions awaiting split + generation.The jobs read from the trace occupy a queue of this depth before + being processed by the submission threads. It is unusual to configure + this.
gridmix.min.key.lengthThe key size for jobs submitted to the cluster.While this is clearly a job-specific, even task-specific property, + no data on key length is currently available. Since the intermediate + data are random, memcomparable data, not even the sort is likely + affected. It exists as a tunable as no default value is appropriate, + but future versions will likely replace it with trace data.
+ +
+
+ +
+ + Simplifying Assumptions + +

Gridmix will be developed in stages, incorporating feedback and patches + from the community. Currently, its intent is to evaluate Map/Reduce and HDFS + performance and not the layers on top of them (i.e. the extensive lib and + subproject space). Given these two limitations, the following + characteristics of job load are not currently captured in job traces and + cannot be accurately reproduced in Gridmix.

+ + + + + + + + + + +
PropertyNotes
CPU usageWe have no data for per-task CPU usage, so we + cannot attempt even an approximation. Gridmix tasks are never CPU bound + independent of I/O, though this surely happens in practice.
Filesystem propertiesNo attempt is made to match block + sizes, namespace hierarchies, or any property of input, intermediate, or + output data other than the bytes/records consumed and emitted from a given + task. This implies that some of the most heavily used parts of the system- + the compression libraries, text processing, streaming, etc.- cannot be + meaningfully tested with the current implementation.
I/O ratesThe rate at which records are consumed/emitted is + assumed to be limited only by the speed of the reader/writer and constant + throughout the task.
Memory profileNo data on tasks' memory usage over time is + available, though the max heap size is retained.
SkewThe records consumed and emitted to/from a given task + are assumed to follow observed averages, i.e. records will be more regular + than may be seen in the wild. Each map also generates a proportional + percentage of data for each reduce, so a job with unbalanced input will be + flattened.
Job failureUser code is assumed to be correct.
Job independenceThe output or outcome of one job does not + affect when or whether a subsequent job will run.
+ +
+ +
+ + Appendix + +

Issues tracking the implementations of gridmix1, gridmix2, and + gridmix3. + Other issues tracking the development of Gridmix can be found by searching + the Map/Reduce JIRA

+ +
+ + + +
Modified: hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml?rev=823229&r1=823228&r2=823229&view=diff ============================================================================== --- hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml (original) +++ hadoop/mapreduce/branches/branch-0.21/src/docs/src/documentation/content/xdocs/site.xml Thu Oct 8 16:41:56 2009 @@ -43,6 +43,7 @@ +