Gridmix

- -

- Gridmix -

- - - -

- Overview - -

Gridmix is a benchmark for live clusters. It submits a mix of synthetic - jobs, modeling a profile mined from production loads.

- -

There exist three versions of the Gridmix tool. This document discusses - the third (checked into contrib), distinct from the two checked into the - benchmarks subdirectory. While the first two versions of the tool included - stripped-down versions of common jobs, both were principally saturation - tools for stressing the framework at scale. In support of a broader range of - deployments and finer-tuned job mixes, this version of the tool will attempt - to model the resource profiles of production jobs to identify bottlenecks, - guide development, and serve as a replacement for the existing gridmix - benchmarks.

- -

- - Usage - -

To run Gridmix, one requires a job trace describing the job mix for a - given cluster. Such traces are typically genenerated by Rumen (see related - documentation). Gridmix also requires input data from which the synthetic - jobs will draw bytes. The input data need not be in any particular format, - as the synthetic jobs are currently binary readers. If one is running on a - new cluster, an optional step generating input data may precede the run.

- -

Basic command line usage:

- - -bin/mapred org.apache.hadoop.mapred.gridmix.Gridmix [-generate <MiB>] <iopath> <trace> - - -

The -generate parameter accepts standard units, e.g. - 100g will generate 100 * 2³⁰ bytes. The - <iopath> parameter is the destination directory for generated and/or - the directory from which input data will be read. The <trace> - parameter is a path to a job trace. The following configuration parameters - are also accepted in the standard idiom, before other Gridmix - parameters.

- -

- Configuration parameters -

- - - - - - - - - - - - - - -

Parameter	Description	Notes
`gridmix.output.directory`	The directory into which output will be written. If specified, the - `iopath` will be relative to this parameter.	The submitting user must have read/write access to this - directory. The user should also be mindful of any quota issues that - may arise during a run.
`gridmix.client.submit.threads`	The number of threads submitting jobs to the cluster. This also - controls how many splits will be loaded into memory at a given time, - pending the submit time in the trace.	Splits are pregenerated to hit submission deadlines, so - particularly dense traces may want more submitting threads. However, - storing splits in memory is reasonably expensive, so one should raise - this cautiously.
`gridmix.client.pending.queue.depth`	The depth of the queue of job descriptions awaiting split - generation.	The jobs read from the trace occupy a queue of this depth before - being processed by the submission threads. It is unusual to configure - this.
`gridmix.min.key.length`	The key size for jobs submitted to the cluster.	While this is clearly a job-specific, even task-specific property, - no data on key length is currently available. Since the intermediate - data are random, memcomparable data, not even the sort is likely - affected. It exists as a tunable as no default value is appropriate, - but future versions will likely replace it with trace data.

- -

- - Simplifying Assumptions - -

Gridmix will be developed in stages, incorporating feedback and patches - from the community. Currently, its intent is to evaluate Map/Reduce and HDFS - performance and not the layers on top of them (i.e. the extensive lib and - subproject space). Given these two limitations, the following - characteristics of job load are not currently captured in job traces and - cannot be accurately reproduced in Gridmix.

- - - - - - - - - - -

Property	Notes
CPU usage	We have no data for per-task CPU usage, so we - cannot attempt even an approximation. Gridmix tasks are never CPU bound - independent of I/O, though this surely happens in practice.
Filesystem properties	No attempt is made to match block - sizes, namespace hierarchies, or any property of input, intermediate, or - output data other than the bytes/records consumed and emitted from a given - task. This implies that some of the most heavily used parts of the system- - the compression libraries, text processing, streaming, etc.- cannot be - meaningfully tested with the current implementation.
I/O rates	The rate at which records are consumed/emitted is - assumed to be limited only by the speed of the reader/writer and constant - throughout the task.
Memory profile	No data on tasks' memory usage over time is - available, though the max heap size is retained.
Skew	The records consumed and emitted to/from a given task - are assumed to follow observed averages, i.e. records will be more regular - than may be seen in the wild. Each map also generates a proportional - percentage of data for each reduce, so a job with unbalanced input will be - flattened.
Job failure	User code is assumed to be correct.
Job independence	The output or outcome of one job does not - affect when or whether a subsequent job will run.

- -

- - Appendix - -

Issues tracking the implementations of gridmix1, gridmix2, and - gridmix3. - Other issues tracking the development of Gridmix can be found by searching - the Map/Reduce JIRA

- -

- - - +

+ GridMix +

+ +

+ Overview +

GridMix is a benchmark for Hadoop clusters. It submits a mix of + synthetic jobs, modeling a profile mined from production loads.

There exist three versions of the GridMix tool. This document + discusses the third (checked into src/contrib), distinct + from the two checked into the src/benchmarks sub-directory. + While the first two versions of the tool included stripped-down versions + of common jobs, both were principally saturation tools for stressing the + framework at scale. In support of a broader range of deployments and + finer-tuned job mixes, this version of the tool will attempt to model + the resource profiles of production jobs to identify bottlenecks, guide + development, and serve as a replacement for the existing GridMix + benchmarks.

To run GridMix, you need a MapReduce job trace describing the job mix + for a given cluster. Such traces are typically generated by Rumen (see + Rumen documentation). GridMix also requires input data from which the + synthetic jobs will be reading bytes. The input data need not be in any + particular format, as the synthetic jobs are currently binary readers. + If you are running on a new cluster, an optional step generating input + data may precede the run.

In order to emulate the load of production jobs from a given cluster + on the same or another cluster, follow these steps:

Locate the job history files on the production cluster. This + location is specified by the + mapreduce.jobtracker.jobhistory.completed.location + configuration property of the cluster.
Run Rumen to build a job trace in JSON format for all or select + jobs.
(Optional) Use Rumen to fold this job trace to scale + the load.
Use GridMix with the job trace on the benchmark cluster.

Jobs submitted by GridMix have names of the form + "GRIDMIXnnnnn", where + "nnnnn" is a sequence number padded with leading + zeroes.

+ Usage +

Basic command-line usage without configuration parameters:

+ +org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace> + +

Basic command-line usage with configuration parameters:

+ +org.apache.hadoop.mapred.gridmix.Gridmix \ + -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \ + [-generate <size>] [-users <users-list>] <iopath> <trace> + + + Configuration parameters like + -Dgridmix.client.submit.threads=10 and + -Dgridmix.output.directory=foo as given above should + be used before other GridMix parameters. + +

The -generate option is used to generate input data + for the synthetic jobs. It accepts size suffixes, e.g. + 100g will generate 100 * 2³⁰ bytes.

The -users option is used to point to a users-list + file (see Emulating Users and Queues).

The <iopath> parameter is the destination + directory for generated output and/or the directory from which input + data will be read. Note that this can either be on the local file-system + or on HDFS, but it is highly recommended that it be the same as that for + the original job mix so that GridMix puts the same load on the local + file-system and HDFS respectively.

The <trace> parameter is a path to a job trace + generated by Rumen. This trace can be compressed (it must be readable + using one of the compression codecs supported by the cluster) or + uncompressed. Use "-" as the value of this parameter if you + want to pass an uncompressed trace via the standard + input-stream of GridMix.

The class org.apache.hadoop.mapred.gridmix.Gridmix can + be found in the JAR + contrib/gridmix/hadoop-$VERSION-gridmix.jar inside your + Hadoop installation, where $VERSION corresponds to the + version of Hadoop installed. A simple way of ensuring that this class + and all its dependencies are loaded correctly is to use the + hadoop wrapper script in Hadoop:

+ +hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \ + [-generate <size>] [-users <users-list>] <iopath> <trace> + +

The supported configuration parameters are explained in the + following sections.

+ General Configuration Parameters +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Parameter Description
+ gridmix.output.directory + The directory into which output will be written. If specified, + iopath will be relative to this parameter. The + submitting user must have read/write access to this directory. The + user should also be mindful of any quota issues that may arise + during a run. The default is "gridmix".
+ gridmix.client.submit.threads + The number of threads submitting jobs to the cluster. This + also controls how many splits will be loaded into memory at a given + time, pending the submit time in the trace. Splits are pre-generated + to hit submission deadlines, so particularly dense traces may want + more submitting threads. However, storing splits in memory is + reasonably expensive, so you should raise this cautiously. The + default is 1 for the SERIAL job-submission policy (see + Job Submission Policies) and one more than + the number of processors on the client machine for the other + policies.
+ gridmix.submit.multiplier + The multiplier to accelerate or decelerate the submission of + jobs. The time separating two jobs is multiplied by this factor. + The default value is 1.0. This is a crude mechanism to size + a job trace to a cluster.
+ gridmix.client.pending.queue.depth + The depth of the queue of job descriptions awaiting split + generation. The jobs read from the trace occupy a queue of this + depth before being processed by the submission threads. It is + unusual to configure this. The default is 5.
+ gridmix.gen.blocksize + The block-size of generated data. The default value is 256 + MiB.
+ gridmix.gen.bytes.per.file + The maximum bytes written per file. The default value is 1 + GiB.
+ gridmix.min.file.size + The minimum size of the input files. The default limit is 128 + MiB. Tweak this parameter if you see an error-message like + "Found no satisfactory file" while testing GridMix with + a relatively-small input data-set.
+ gridmix.max.total.scan + The maximum size of the input files. The default limit is 100 + TiB.
+

Parameter	Description
+ `gridmix.output.directory` +	The directory into which output will be written. If specified, + `iopath` will be relative to this parameter. The + submitting user must have read/write access to this directory. The + user should also be mindful of any quota issues that may arise + during a run. The default is "`gridmix`".
+ `gridmix.client.submit.threads` +	The number of threads submitting jobs to the cluster. This + also controls how many splits will be loaded into memory at a given + time, pending the submit time in the trace. Splits are pre-generated + to hit submission deadlines, so particularly dense traces may want + more submitting threads. However, storing splits in memory is + reasonably expensive, so you should raise this cautiously. The + default is 1 for the SERIAL job-submission policy (see + Job Submission Policies) and one more than + the number of processors on the client machine for the other + policies.
+ `gridmix.submit.multiplier` +	The multiplier to accelerate or decelerate the submission of + jobs. The time separating two jobs is multiplied by this factor. + The default value is 1.0. This is a crude mechanism to size + a job trace to a cluster.
+ `gridmix.client.pending.queue.depth` +	The depth of the queue of job descriptions awaiting split + generation. The jobs read from the trace occupy a queue of this + depth before being processed by the submission threads. It is + unusual to configure this. The default is 5.
+ `gridmix.gen.blocksize` +	The block-size of generated data. The default value is 256 + MiB.
+ `gridmix.gen.bytes.per.file` +	The maximum bytes written per file. The default value is 1 + GiB.
+ `gridmix.min.file.size` +	The minimum size of the input files. The default limit is 128 + MiB. Tweak this parameter if you see an error-message like + "Found no satisfactory file" while testing GridMix with + a relatively-small input data-set.
+ `gridmix.max.total.scan` +	The maximum size of the input files. The default limit is 100 + TiB.

+ Job Types +

GridMix takes as input a job trace, essentially a stream of + JSON-encoded job descriptions. For each job description, the submission + client obtains the original job submission time and for each task in + that job, the byte and record counts read and written. Given this data, + it constructs a synthetic job with the same byte and record patterns as + recorded in the trace. It constructs jobs of two types:

+ + + + + + + + + + + + + +

Job Type Description

+ LOADJOB + A synthetic job that emulates the workload mentioned in Rumen + trace. In the current version we are supporting I/O. It reproduces + the I/O workload on the benchmark cluster. It does so by embedding + the detailed I/O information for every map and reduce task, such as + the number of bytes and records read and written, into each + job's input splits. The map tasks further relay the I/O patterns of + reduce tasks through the intermediate map output data.

+ SLEEPJOB + A synthetic job where each task does nothing but sleep + for a certain duration as observed in the production trace. The + scalability of the Job Tracker is often limited by how many + heartbeats it can handle every second. (Heartbeats are periodic + messages sent from Task Trackers to update their status and grab new + tasks from the Job Tracker.) Since a benchmark cluster is typically + a fraction in size of a production cluster, the heartbeat traffic + generated by the slave nodes is well below the level of the + production cluster. One possible solution is to run multiple Task + Trackers on each slave node. This leads to the obvious problem that + the I/O workload generated by the synthetic jobs would thrash the + slave nodes. Hence the need for such a job.

Job Type	Description
+ `LOADJOB` +	A synthetic job that emulates the workload mentioned in Rumen + trace. In the current version we are supporting I/O. It reproduces + the I/O workload on the benchmark cluster. It does so by embedding + the detailed I/O information for every map and reduce task, such as + the number of bytes and records read and written, into each + job's input splits. The map tasks further relay the I/O patterns of + reduce tasks through the intermediate map output data.
+ `SLEEPJOB` +	A synthetic job where each task does nothing but sleep + for a certain duration as observed in the production trace. The + scalability of the Job Tracker is often limited by how many + heartbeats it can handle every second. (Heartbeats are periodic + messages sent from Task Trackers to update their status and grab new + tasks from the Job Tracker.) Since a benchmark cluster is typically + a fraction in size of a production cluster, the heartbeat traffic + generated by the slave nodes is well below the level of the + production cluster. One possible solution is to run multiple Task + Trackers on each slave node. This leads to the obvious problem that + the I/O workload generated by the synthetic jobs would thrash the + slave nodes. Hence the need for such a job.

The following configuration parameters affect the job type:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Parameter	Description
+ `gridmix.job.type` +	The value for this key can be one of LOADJOB or SLEEPJOB. The + default value is LOADJOB.
+ `gridmix.key.fraction` +	For a LOADJOB type of job, the fraction of a record used for + the data for the key. The default value is 0.1.
+ `gridmix.sleep.maptask-only` +	For a SLEEPJOB type of job, whether to ignore the reduce + tasks for the job. The default is `false`.
+ `gridmix.sleep.fake-locations` +	For a SLEEPJOB type of job, the number of fake locations + for map tasks for the job. The default is 0.
+ `gridmix.sleep.max-map-time` +	For a SLEEPJOB type of job, the maximum runtime for map + tasks for the job in milliseconds. The default is unlimited.
+ `gridmix.sleep.max-reduce-time` +	For a SLEEPJOB type of job, the maximum runtime for reduce + tasks for the job in milliseconds. The default is unlimited.

+ Job Submission Policies +

GridMix controls the rate of job submission. This control can be + based on the trace information or can be based on statistics it gathers + from the Job Tracker. Based on the submission policies users define, + GridMix uses the respective algorithm to control the job submission. + There are currently three types of policies:

+ + + + + + + + + + + + + + + + + +

Job Submission Policy	Description
+ `STRESS` +	Keep submitting jobs so that the cluster remains under stress. + In this mode we control the rate of job submission by monitoring + the real-time load of the cluster so that we can maintain a stable + stress level of workload on the cluster. Based on the statistics we + gather we define if a cluster is underloaded or + overloaded. We consider a cluster underloaded if + and only if the following three conditions are true: + + the number of pending and running jobs are under a threshold + TJ + the number of pending and running maps are under threshold + TM + the number of pending and running reduces are under threshold + TR + + The thresholds TJ, TM and TR are proportional to the size of the + cluster and map, reduce slots capacities respectively. In case of a + cluster being overloaded, we throttle the job submission. + In the actual calculation we also weigh each running task with its + remaining work - namely, a 90% complete task is only counted as 0.1 + in calculation. Finally, to avoid a very large job blocking other + jobs, we limit the number of pending/waiting tasks each job can + contribute.
+ `REPLAY` +	In this mode we replay the job traces faithfully. This mode + exactly follows the time-intervals given in the actual job + trace.
+ `SERIAL` +	In this mode we submit the next job only once the job submitted + earlier is completed.

The following configuration parameters affect the job submission + policy:

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Parameter	Description
+ `gridmix.job-submission.policy` +	The value for this key would one of the three: STRESS, REPLAY or + SERIAL. In most of the cases the value of key would be STRESS or + REPLAY. The default value is STRESS.
+ `gridmix.throttle.jobs-to-tracker-ratio` +	In STRESS mode, the minimum ratio of running jobs to Task + Trackers in a cluster for the cluster to be considered + overloaded. This is the threshold TJ referred to earlier. + The default is 1.0.
+ `gridmix.throttle.maps.task-to-slot-ratio` +	In STRESS mode, the minimum ratio of pending and running map + tasks (i.e. incomplete map tasks) to the number of map slots for + a cluster for the cluster to be considered overloaded. + This is the threshold TM referred to earlier. Running map tasks are + counted partially. For example, a 40% complete map task is counted + as 0.6 map tasks. The default is 2.0.
+ `gridmix.throttle.reduces.task-to-slot-ratio` +	In STRESS mode, the minimum ratio of pending and running reduce + tasks (i.e. incomplete reduce tasks) to the number of reduce slots + for a cluster for the cluster to be considered overloaded. + This is the threshold TR referred to earlier. Running reduce tasks + are counted partially. For example, a 30% complete reduce task is + counted as 0.7 reduce tasks. The default is 2.5.
+ `gridmix.throttle.maps.max-slot-share-per-job` +	In STRESS mode, the maximum share of a cluster's map-slots + capacity that can be counted toward a job's incomplete map tasks in + overload calculation. The default is 0.1.
+ `gridmix.throttle.reducess.max-slot-share-per-job` +	In STRESS mode, the maximum share of a cluster's reduce-slots + capacity that can be counted toward a job's incomplete reduce tasks + in overload calculation. The default is 0.1.

+ Emulating Users and Queues +

Typical production clusters are often shared with different users and + the cluster capacity is divided among different departments through job + queues. Ensuring fairness among jobs from all users, honoring queue + capacity allocation policies and avoiding an ill-behaving job from + taking over the cluster adds significant complexity in Hadoop software. + To be able to sufficiently test and discover bugs in these areas, + GridMix must emulate the contentions of jobs from different users and/or + submitted to different queues.

Emulating multiple queues is easy - we simply set up the benchmark + cluster with the same queue configuration as the production cluster and + we configure synthetic jobs so that they get submitted to the same queue + as recorded in the trace. However, not all users shown in the trace have + accounts on the benchmark cluster. Instead, we set up a number of testing + user accounts and associate each unique user in the trace to testing + users in a round-robin fashion.

The following configuration parameters affect the emulation of users + and queues:

+ + + + + + + + + + + + + + + + + +

Parameter	Description
+ `gridmix.job-submission.use-queue-in-trace` +	When set to `true` it uses exactly the same set of + queues as those mentioned in the trace. The default value is + `false`.
+ `gridmix.job-submission.default-queue` +	Specifies the default queue to which all the jobs would be + submitted. If this parameter is not specified, GridMix uses the + default queue defined for the submitting user on the cluster.
+ `gridmix.user.resolve.class` +	Specifies which `UserResolver` implementation to use. + We currently have three implementations: + + `org.apache.hadoop.mapred.gridmix.EchoUserResolver` + - submits a job as the user who submitted the original job. All + the users of the production cluster identified in the job trace + must also have accounts on the benchmark cluster in this case. + `org.apache.hadoop.mapred.gridmix.SubmitterUserResolver` + - submits all the jobs as current GridMix user. In this case we + simply map all the users in the trace to the current GridMix user + and submit the job. + `org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver` + - maps trace users to test users in a round-robin fashion. In + this case we set up a number of testing user accounts and + associate each unique user in the trace to testing users in a + round-robin fashion. + + The default is + `org.apache.hadoop.mapred.gridmix.SubmitterUserResolver`.

If the parameter gridmix.user.resolve.class is set to + org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver, + we need to define a users-list file with a list of test users and groups. + This is specified using the -users option to GridMix.

+ + Specifying a users-list file using the -users option is + mandatory when using the round-robin user-resolver. Other user-resolvers + ignore this option. + +

A users-list file has one user-group-information (UGI) per line, each + UGI of the format:

+ + <username>,<group>[,group]* + +

For example:

+ + user1,group1 + user2,group2,group3 + user3,group3,group4 + +

In the above example we have defined three users user1, + user2 and user3 with their respective groups. + Now we would associate each unique user in the trace to the above users + defined in round-robin fashion. For example, if traces users are + tuser1, tuser2, tuser3, + tuser4 and tuser5, then the mappings would + be:

+ + tuser1 -> user1 + tuser2 -> user2 + tuser3 -> user3 + tuser4 -> user1 + tuser5 -> user2 + +

+ Simplifying Assumptions +

GridMix will be developed in stages, incorporating feedback and + patches from the community. Currently its intent is to evaluate + MapReduce and HDFS performance and not the layers on top of them (i.e. + the extensive lib and sub-project space). Given these two limitations, + the following characteristics of job load are not currently captured in + job traces and cannot be accurately reproduced in GridMix:

CPU Usage - We have no data for per-task CPU usage, so we + cannot even attempt an approximation. GridMix tasks are never + CPU-bound independent of I/O, though this surely happens in + practice.
Filesystem Properties - No attempt is made to match block + sizes, namespace hierarchies, or any property of input, intermediate + or output data other than the bytes/records consumed and emitted from + a given task. This implies that some of the most heavily-used parts of + the system - the compression libraries, text processing, streaming, + etc. - cannot be meaningfully tested with the current + implementation.
I/O Rates - The rate at which records are + consumed/emitted is assumed to be limited only by the speed of the + reader/writer and constant throughout the task.
Memory Profile - No data on tasks' memory usage over time + is available, though the max heap-size is retained.
Skew - The records consumed and emitted to/from a given + task are assumed to follow observed averages, i.e. records will be + more regular than may be seen in the wild. Each map also generates + a proportional percentage of data for each reduce, so a job with + unbalanced input will be flattened.
Job Failure - User code is assumed to be correct.
Job Independence - The output or outcome of one job does + not affect when or whether a subsequent job will run.

+ Appendix +

Issues tracking the original implementations of GridMix1, + GridMix2, + and GridMix3 + can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking + the current development of GridMix can be found by searching the + Apache Hadoop MapReduce JIRA