Return-Path: Delivered-To: apmail-hadoop-mapreduce-commits-archive@minotaur.apache.org Received: (qmail 41495 invoked from network); 28 Nov 2009 20:27:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Nov 2009 20:27:36 -0000 Received: (qmail 1285 invoked by uid 500); 28 Nov 2009 20:27:36 -0000 Delivered-To: apmail-hadoop-mapreduce-commits-archive@hadoop.apache.org Received: (qmail 1230 invoked by uid 500); 28 Nov 2009 20:27:36 -0000 Mailing-List: contact mapreduce-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-dev@hadoop.apache.org Delivered-To: mailing list mapreduce-commits@hadoop.apache.org Received: (qmail 1220 invoked by uid 99); 28 Nov 2009 20:27:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Nov 2009 20:27:36 +0000 X-ASF-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Nov 2009 20:27:29 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id ED4282388A74; Sat, 28 Nov 2009 20:26:36 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r885145 [12/34] - in /hadoop/mapreduce/branches/MAPREDUCE-233: ./ .eclipse.templates/ .eclipse.templates/.launches/ conf/ ivy/ lib/ src/benchmarks/gridmix/ src/benchmarks/gridmix/pipesort/ src/benchmarks/gridmix2/ src/benchmarks/gridmix2/sr... Date: Sat, 28 Nov 2009 20:26:22 -0000 To: mapreduce-commits@hadoop.apache.org From: stevel@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20091128202636.ED4282388A74@eris.apache.org> Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Sat Nov 28 20:26:01 2009 @@ -21,7 +21,7 @@
- Map/Reduce Tutorial + MapReduce Tutorial
@@ -30,22 +30,21 @@ Purpose

This document comprehensively describes all user-facing facets of the - Hadoop Map/Reduce framework and serves as a tutorial. + Hadoop MapReduce framework and serves as a tutorial.

- Pre-requisites + Prerequisites -

Ensure that Hadoop is installed, configured and is running. More - details:

+

Make sure Hadoop is installed, configured and running. See these guides: +

@@ -53,12 +52,12 @@
Overview -

Hadoop Map/Reduce is a software framework for easily writing +

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

-

A Map/Reduce job usually splits the input data-set into +

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the @@ -67,13 +66,14 @@ tasks.

Typically the compute nodes and the storage nodes are the same, that is, - the Map/Reduce framework and the Hadoop Distributed File System (see HDFS Architecture ) + the MapReduce framework and the + Hadoop Distributed File System (HDFS) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

-

The Map/Reduce framework consists of a single master +

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The @@ -90,7 +90,7 @@ information to the job-client.

Although the Hadoop framework is implemented in JavaTM, - Map/Reduce applications need not be written in Java.

+ MapReduce applications need not be written in Java.

@@ -110,7 +110,7 @@
Inputs and Outputs -

The Map/Reduce framework operates exclusively on +

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of @@ -124,7 +124,7 @@ WritableComparable interface to facilitate sorting by the framework.

-

Input and Output types of a Map/Reduce job:

+

Input and Output types of a MapReduce job:

(input) <k1, v1> -> @@ -145,14 +145,16 @@

Example: WordCount v1.0 -

Before we jump into the details, lets walk through an example Map/Reduce +

Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work.

WordCount is a simple application that counts the number of occurences of each word in a given input set.

-

This works with a local-standalone, pseudo-distributed or fully-distributed - Hadoop installation(see Hadoop Quick Start).

+

This example works with a + pseudo-distributed (Single Node Setup) + or fully-distributed (Cluster Setup) + Hadoop installation.

Source Code @@ -605,17 +607,35 @@ would be present in the current working directory of the task using the option -files. The -libjars option allows applications to add jars to the classpaths of the maps - and reduces. The -archives allows them to pass archives - as arguments that are unzipped/unjarred and a link with name of the - jar/zip are created in the current working directory of tasks. More + and reduces. The option -archives allows them to pass + comma separated list of archives as arguments. These archives are + unarchived and a link with name of the archive is created in + the current working directory of tasks. More details about the command line options are available at - Hadoop Command Guide.

+ Hadoop Commands Guide.

Running wordcount example with - -libjars and -files:
+ -libjars, -files and -archives: +
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt - -libjars mylib.jar input output -

+ -libjars mylib.jar -archives myarchive.zip input output + Here, myarchive.zip will be placed and unzipped into a directory + by the name "myarchive.zip" +

+ +

Users can specify a different symbolic name for + files and archives passed through -files and -archives option, using #. +

+ +

For example, + hadoop jar hadoop-examples.jar wordcount + -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 + -archives mytar.tgz#tgzdir input output + Here, the files dir1/dict.txt and dir2/dict.txt can be accessed by + tasks using the symbolic names dict1 and dict2 respectively. + And the archive mytar.tgz will be placed and unarchived into a + directory by the name tgzdir +

@@ -697,10 +717,10 @@
- Map/Reduce - User Interfaces + MapReduce - User Interfaces

This section provides a reasonable amount of detail on every user-facing - aspect of the Map/Reduce framwork. This should help users implement, + aspect of the MapReduce framwork. This should help users implement, configure and tune their jobs in a fine-grained manner. However, please note that the javadoc for each class/interface remains the most comprehensive documentation available; this is only meant to be a tutorial. @@ -739,7 +759,7 @@ to be of the same type as the input records. A given input pair may map to zero or many output pairs.

-

The Hadoop Map/Reduce framework spawns one map task for each +

The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

@@ -898,7 +918,7 @@

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * - mapred.tasktracker.reduce.tasks.maximum).

+ mapreduce.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With @@ -950,7 +970,7 @@ Reporter

- Reporter is a facility for Map/Reduce applications to report + Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

@@ -960,7 +980,7 @@ significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to - set the configuration parameter mapred.task.timeout to a + set the configuration parameter mapreduce.task.timeout to a high-enough value (or even set it to zero for no time-outs).

@@ -973,12 +993,12 @@

OutputCollector is a generalization of the facility provided by - the Map/Reduce framework to collect data output by the + the MapReduce framework to collect data output by the Mapper or the Reducer (either the intermediate outputs or the output of the job).

-

Hadoop Map/Reduce comes bundled with a +

Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners.

@@ -987,10 +1007,10 @@ Job Configuration

- JobConf represents a Map/Reduce job configuration.

+ JobConf represents a MapReduce job configuration.

JobConf is the primary interface for a user to describe - a Map/Reduce job to the Hadoop framework for execution. The framework + a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by JobConf, however:

    @@ -1058,7 +1078,7 @@ -Djava.library.path=<> etc. If the mapred.{map|reduce}.child.java.opts parameters contains the symbol @taskid@ it is interpolated with value of - taskid of the map/reduce task.

    + taskid of the MapReduce task.

    Here is an example with multiple arguments and substitutions, showing jvm GC logging, and start of a passwordless JVM JMX agent so that @@ -1070,7 +1090,7 @@

    <property>
    -   <name>mapred.map.child.java.opts</name>
    +   <name>mapreduce.map.java.opts</name>
      <value>
         -Xmx512M -Djava.library.path=/home/mycompany/lib @@ -1084,7 +1104,7 @@

    <property>
    -   <name>mapred.reduce.child.java.opts</name>
    +   <name>mapreduce.reduce.java.opts</name>
      <value>
         -Xmx1024M -Djava.library.path=/home/mycompany/lib @@ -1109,9 +1129,9 @@

    Note: mapred.{map|reduce}.child.java.opts are used only for configuring the launched child tasks from task tracker. Configuring - the memory options for daemons is documented in - - cluster_setup.html

    + the memory options for daemons is documented under + + Configuring the Environment of the Hadoop Daemons (Cluster Setup).

    The memory available to some parts of the framework is also configurable. In map and reduce tasks, performance may be influenced @@ -1157,26 +1177,28 @@ - + - + - +
    NameTypeDescription
    io.sort.mbint
    mapreduce.task.io.sort.mbint The cumulative size of the serialization and accounting buffers storing records emitted from the map, in megabytes.
    io.sort.record.percentfloat
    mapreduce.map.sort.record.percentfloat The ratio of serialization to accounting space can be adjusted. Each serialized record requires 16 bytes of accounting information in addition to its serialized size to effect the sort. This percentage of space allocated from - io.sort.mb affects the probability of a spill to + mapreduce.task.io.sort.mb affects the + probability of a spill to disk being caused by either exhaustion of the serialization buffer or the accounting space. Clearly, for a map outputting small records, a higher value than the default will likely decrease the number of spills to disk.
    io.sort.spill.percentfloat
    mapreduce.map.sort.spill.percentfloat This is the threshold for the accounting and serialization buffers. When this percentage of either buffer has filled, their contents will be spilled to disk in the background. Let - io.sort.record.percent be r, - io.sort.mb be x, and this value be + mapreduce.map.sort.record.percent be r, + mapreduce.task.io.sort.mb be x, + and this value be q. The maximum number of records collected before the collection thread will spill is r * x * q * 2^16. Note that a higher value may decrease the number of- or even @@ -1216,7 +1238,7 @@ - + - + - + - + - + @@ -3124,7 +3187,7 @@ Highlights

    The second version of WordCount improves upon the - previous one by using some features offered by the Map/Reduce framework: + previous one by using some features offered by the MapReduce framework:

    • Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml Sat Nov 28 20:26:01 2009 @@ -34,39 +34,23 @@ - - - - + + + - - - - - - - - - - + + + + + + - - - - - - - - - - - - - - - - + + + + + @@ -78,19 +62,20 @@ - - - - - + + + + + - - - - - - - + + + + + + + + Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml Sat Nov 28 20:26:01 2009 @@ -30,7 +30,7 @@ Hadoop Streaming

      -Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or +Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer. For example:

      @@ -47,7 +47,7 @@ How Streaming Works

      In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. -The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes. +The utility will create a MapReduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.

      When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. @@ -63,7 +63,7 @@ prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.

      -This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer. +This is the basis for the communication protocol between the MapReduce framework and the streaming mapper/reducer.

      You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to: @@ -161,7 +161,7 @@

      Specifying Other Plugins for Jobs

      -Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job: +Just as with a normal MapReduce job, you can specify other plugins for a streaming job:

      -inputformat JavaClassName @@ -188,7 +188,7 @@
      Generic Command Options -

      Streaming supports streaming command options as well as generic command options. +

      Streaming supports generic command options as well as streaming command options. The general command line syntax is shown below.

      Note: Be sure to place the generic options before the streaming options, otherwise the command will fail. For an example, see Making Archives Available to Tasks.

      @@ -201,7 +201,7 @@
    - +
    NameTypeDescription
    io.sort.factorint
    mapreduce.task.io.sort.factorint Specifies the number of segments on disk to be merged at the same time. It limits the number of open files and compression codecs during the merge. If the number of files @@ -1224,7 +1246,7 @@ Though this limit also applies to the map, most jobs should be configured so that hitting this limit is unlikely there.
    mapred.inmem.merge.thresholdint
    mapreduce.reduce.merge.inmem.thresholdint The number of sorted map outputs fetched into memory before being merged to disk. Like the spill thresholds in the preceding note, this is not defining a unit of partition, but @@ -1233,7 +1255,7 @@ less expensive than merging from disk (see notes following this table). This threshold influences only the frequency of in-memory merges during the shuffle.
    mapred.job.shuffle.merge.percentfloat
    mapreduce.reduce.shuffle.merge.percentfloat The memory threshold for fetched map outputs before an in-memory merge is started, expressed as a percentage of memory allocated to storing map outputs in memory. Since map @@ -1243,14 +1265,14 @@ reduces whose input can fit entirely in memory. This parameter influences only the frequency of in-memory merges during the shuffle.
    mapred.job.shuffle.input.buffer.percentfloat
    mapreduce.reduce.shuffle.input.buffer.percentfloat The percentage of memory- relative to the maximum heapsize - as typically specified in mapred.reduce.child.java.opts- + as typically specified in mapreduce.reduce.java.opts- that can be allocated to storing map outputs during the shuffle. Though some memory should be set aside for the framework, in general it is advantageous to set this high enough to store large and numerous map outputs.
    mapred.job.reduce.input.buffer.percentfloat
    mapreduce.reduce.input.buffer.percentfloat The percentage of memory relative to the maximum heapsize in which map outputs may be retained during the reduce. When the reduce begins, map outputs will be merged to disk until @@ -1275,7 +1297,8 @@ than aggressively increasing buffer sizes.
  • When merging in-memory map outputs to disk to begin the reduce, if an intermediate merge is necessary because there are - segments to spill and at least io.sort.factor + segments to spill and at least + mapreduce.task.io.sort.factor segments already on disk, the in-memory map outputs will be part of the intermediate merge.
  • @@ -1285,7 +1308,7 @@
    Directory Structure

    The task tracker has local directory, - ${mapred.local.dir}/taskTracker/ to create localized + ${mapreduce.cluster.local.dir}/taskTracker/ to create localized cache and localized job. It can define multiple local directories (spanning multiple disks) and then each filename is assigned to a semi-random local directory. When the job starts, task tracker @@ -1293,24 +1316,24 @@ specified in the configuration. Thus the task tracker directory structure looks the following:

      -
    • ${mapred.local.dir}/taskTracker/archive/ : +
    • ${mapreduce.cluster.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
    • -
    • ${mapred.local.dir}/taskTracker/jobcache/$jobid/ : +
    • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/ : The localized job directory
        -
      • ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ +
      • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/work/ : The job-specific shared directory. The tasks can use this space as scratch space and share files among them. This directory is exposed to the users through the configuration property - job.local.dir. The directory can accessed through + mapreduce.job.local.dir. The directory can accessed through api JobConf.getJobLocalDir(). It is available as System property also. So, users (streaming etc.) can call - System.getProperty("job.local.dir") to access the + System.getProperty("mapreduce.job.local.dir") to access the directory.
      • -
      • ${mapred.local.dir}/taskTracker/jobcache/$jobid/jars/ +
      • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/jars/ : The jars directory, which has the job jar file and expanded jar. The job.jar is the application's jar file that is automatically distributed to each machine. It is expanded in jars @@ -1319,29 +1342,29 @@ JobConf.getJar() . To access the unjarred directory, JobConf.getJar().getParent() can be called.
      • -
      • ${mapred.local.dir}/taskTracker/jobcache/$jobid/job.xml +
      • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/job.xml : The job.xml file, the generic job configuration, localized for the job.
      • -
      • ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid +
      • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid : The task directory for each task attempt. Each task directory again has the following structure :
          -
        • ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml +
        • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml : A job.xml file, task localized job configuration, Task localization means that properties have been set that are specific to this particular task within the job. The properties localized for each task are described below.
        • -
        • ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/output +
        • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/output : A directory for intermediate output files. This contains the temporary map reduce data generated by the framework such as map output files etc.
        • -
        • ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work +
        • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/work : The curernt working directory of the task. With jvm reuse enabled for tasks, this directory will be the directory on which the jvm has started
        • -
        • ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp +
        • ${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp : The temporary directory for the task. - (User can specify the property mapred.child.tmp to set + (User can specify the property mapreduce.task.tmp.dir to set the value of temporary directory for map and reduce tasks. This defaults to ./tmp. If the value is not an absolute path, it is prepended with task's working directory. Otherwise, it is @@ -1350,7 +1373,7 @@ -Djava.io.tmpdir='the absolute path of the tmp dir'. Anp pipes and streaming are set with environment variable, TMPDIR='the absolute path of the tmp dir'). This - directory is created, if mapred.child.tmp has the value + directory is created, if mapreduce.task.tmp.dir has the value ./tmp
      • @@ -1362,7 +1385,7 @@
        Task JVM Reuse

        Jobs can enable task JVMs to be reused by specifying the job - configuration mapred.job.reuse.jvm.num.tasks. If the + configuration mapreduce.job.jvm.numtasks. If the value is 1 (the default), then JVMs are not reused (i.e. 1 task per JVM). If it is -1, there is no limit to the number of tasks a JVM can run (of the same job). One can also specify some @@ -1377,26 +1400,26 @@ for each task's execution:

        - - + + - + - + - + - + - + - + - + - + - +
        NameTypeDescription
        mapred.job.idStringThe job id
        mapred.jarString
        mapreduce.job.idStringThe job id
        mapreduce.job.jarString job.jar location in job directory
        job.local.dir String
        mapreduce.job.local.dir String The job specific shared scratch space
        mapred.tip.id String
        mapreduce.task.id String The task id
        mapred.task.id String
        mapreduce.task.attempt.id String The task attempt id
        mapred.task.is.map boolean
        mapreduce.task.ismap boolean Is this a map task
        mapred.task.partition int
        mapreduce.task.partition int The id of the task within the job
        map.input.file String
        mapreduce.map.input.file String The filename that the map is reading from
        map.input.start long
        mapreduce.map.input.start long The offset of the start of the map input split
        map.input.length long
        mapreduce.map.input.length long The number of bytes in the map input split
        mapred.work.output.dir String
        mapreduce.task.output.dir String The task's temporary output directory
        @@ -1404,7 +1427,7 @@ Note: During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). - For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. + For example, mapreduce.job.id becomes mapreduce.job.id and mapreduce.job.jar becomes mapreduce.job.jar. To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.

        @@ -1428,9 +1451,9 @@ System.loadLibrary or System.load. More details on how to load shared libraries through - distributed cache are documented at - - native_libraries.html

        + distributed cache are documented under + + Building Native Hadoop Libraries.

    @@ -1442,7 +1465,7 @@ with the JobTracker.

    JobClient provides facilities to submit jobs, track their - progress, access component-tasks' reports and logs, get the Map/Reduce + progress, access component-tasks' reports and logs, get the MapReduce cluster's status information and so on.

    The job submission process involves:

    @@ -1454,7 +1477,7 @@ DistributedCache of the job, if necessary.
  • - Copying the job's jar and configuration to the Map/Reduce system + Copying the job's jar and configuration to the MapReduce system directory on the FileSystem.
  • @@ -1462,23 +1485,16 @@ monitoring it's status.
  • -

    Job history files are also logged to user specified directory - hadoop.job.history.user.location - which defaults to job output directory. The files are stored in - "_logs/history/" in the specified directory. Hence, by default they - will be in mapred.output.dir/_logs/history. User can stop - logging by giving the value none for - hadoop.job.history.user.location

    -

    User can view the history logs summary in specified directory +

    User can view the history log summary for a given history file using the following command
    - $ bin/hadoop job -history output-dir
    + $ bin/hadoop job -history history-file
    This command will print job details, failed and killed tip details.
    More details about the job such as successful tasks and task attempts made for each task can be viewed using the following command
    - $ bin/hadoop job -history all output-dir

    + $ bin/hadoop job -history all history-file

    User can use OutputLogFilter @@ -1491,8 +1507,8 @@

    Job Control -

    Users may need to chain Map/Reduce jobs to accomplish complex - tasks which cannot be done via a single Map/Reduce job. This is fairly +

    Users may need to chain MapReduce jobs to accomplish complex + tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

    @@ -1526,10 +1542,10 @@ Job Input

    - InputFormat describes the input-specification for a Map/Reduce job. + InputFormat describes the input-specification for a MapReduce job.

    -

    The Map/Reduce framework relies on the InputFormat of +

    The MapReduce framework relies on the InputFormat of the job to:

    1. Validate the input-specification of the job.
    2. @@ -1552,7 +1568,7 @@ InputSplit instances based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound - on the split size can be set via mapred.min.split.size.

      + on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

      Clearly, logical splits based on input-size is insufficient for many applications since record boundaries must be respected. In such cases, @@ -1584,7 +1600,7 @@

      FileSplit is the default InputSplit. It sets - map.input.file to the path of the input file for the + mapreduce.map.input.file to the path of the input file for the logical split.

    @@ -1608,10 +1624,10 @@ Job Output

    - OutputFormat describes the output-specification for a Map/Reduce + OutputFormat describes the output-specification for a MapReduce job.

    -

    The Map/Reduce framework relies on the OutputFormat of +

    The MapReduce framework relies on the OutputFormat of the job to:

    1. @@ -1652,9 +1668,9 @@

      OutputCommitter describes the commit of task output for a - Map/Reduce job.

      + MapReduce job.

      -

      The Map/Reduce framework relies on the OutputCommitter +

      The MapReduce framework relies on the OutputCommitter of the job to:

      1. @@ -1712,34 +1728,34 @@ (using the attemptid, say attempt_200709221812_0001_m_000000_0), not just per task.

        -

        To avoid these issues the Map/Reduce framework, when the +

        To avoid these issues the MapReduce framework, when the OutputCommitter is FileOutputCommitter, maintains a special - ${mapred.output.dir}/_temporary/_${taskid} sub-directory - accessible via ${mapred.work.output.dir} + ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory + accessible via ${mapreduce.task.output.dir} for each task-attempt on the FileSystem where the output of the task-attempt is stored. On successful completion of the task-attempt, the files in the - ${mapred.output.dir}/_temporary/_${taskid} (only) - are promoted to ${mapred.output.dir}. Of course, + ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) + are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This process is completely transparent to the application.

        The application-writer can take advantage of this feature by - creating any side-files required in ${mapred.work.output.dir} + creating any side-files required in ${mapreduce.task.output.dir} during execution of a task via FileOutputFormat.getWorkOutputPath(), and the framework will promote them similarly for succesful task-attempts, thus eliminating the need to pick unique paths per task-attempt.

        -

        Note: The value of ${mapred.work.output.dir} during +

        Note: The value of ${mapreduce.task.output.dir} during execution of a particular task-attempt is actually - ${mapred.output.dir}/_temporary/_{$taskid}, and this value is - set by the Map/Reduce framework. So, just create any side-files in the + ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is + set by the MapReduce framework. So, just create any side-files in the path returned by - FileOutputFormat.getWorkOutputPath() from Map/Reduce + FileOutputFormat.getWorkOutputPath() from MapReduce task to take advantage of this feature.

        The entire discussion holds true for maps of jobs with @@ -1778,7 +1794,7 @@ support multiple queues.

        A job defines the queue it needs to be submitted to through the - mapred.job.queue.name property, or through the + mapreduce.job.queuename property, or through the setQueueName(String) API. Setting the queue name is optional. If a job is submitted without an associated queue name, it is submitted to the 'default' @@ -1788,7 +1804,7 @@ Counters

        Counters represent global counters, defined either by - the Map/Reduce framework or applications. Each Counter can + the MapReduce framework or applications. Each Counter can be of any Enum type. Counters of a particular Enum are bunched into groups of type Counters.Group.

        @@ -1812,7 +1828,7 @@ files efficiently.

        DistributedCache is a facility provided by the - Map/Reduce framework to cache files (text, archives, jars and so on) + MapReduce framework to cache files (text, archives, jars and so on) needed by applications.

        Applications specify the files to be cached via urls (hdfs://) @@ -1858,7 +1874,7 @@ directory of the task via the DistributedCache.createSymlink(Configuration) api. Or by setting - the configuration property mapred.create.symlink + the configuration property mapreduce.job.cache.symlink.create as yes. The DistributedCache will use the fragment of the URI as the name of the symlink. For example, the URI @@ -1877,10 +1893,56 @@ can be used to cache files/jars and also add them to the classpath of child-jvm. The same can be done by setting the configuration properties - mapred.job.classpath.{files|archives}. Similarly the + mapreduce.job.classpath.{files|archives}. Similarly the cached files that are symlinked into the working directory of the task can be used to distribute native libraries and load them.

        +

        The DistributedCache tracks modification timestamps + of the cache files/archives. Clearly the cache files/archives should + not be modified by the application or externally + while the job is executing.

        + +

        Here is an illustrative example on how to use the + DistributedCache:
        + // Setting up the cache for the application + 1. Copy the requisite files to the FileSystem:
        + $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
        + $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
        + $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
        + $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
        + $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
        + $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
        + 2. Setup the job
        + Job job = new Job(conf);
        + job.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"));
        + job.addCacheArchive(new URI("/myapp/map.zip"));
        + job.addFileToClassPath(new Path("/myapp/mylib.jar"));
        + job.addCacheArchive(new URI("/myapp/mytar.tar"));
        + job.addCacheArchive(new URI("/myapp/mytgz.tgz"));
        + job.addCacheArchive(new URI("/myapp/mytargz.tar.gz"));
        + + 3. Use the cached files in the + {@link org.apache.hadoop.mapreduce.Mapper} + or {@link org.apache.hadoop.mapreduce.Reducer}:
        + + public static class MapClass extends Mapper<K, V, K, V> {
        + private Path[] localArchives;
        + private Path[] localFiles;
        + public void setup(Context context) {
        + // Get the cached archives/files
        + localArchives = context.getLocalCacheArchives();
        + localFiles = context.getLocalCacheFiles();
        + }
        + + public void map(K key, V value, + Context context) throws IOException {
        + // Use data from the cached archives/files here
        + // ...
        + // ...
        + context.write(k, v);
        + }
        + }

        +
        @@ -1890,7 +1952,7 @@ interface supports the handling of generic Hadoop command-line options.

        -

        Tool is the standard for any Map/Reduce tool or +

        Tool is the standard for any MapReduce tool or application. The application should delegate the handling of standard command-line options to @@ -1923,7 +1985,7 @@ IsolationRunner

        - IsolationRunner is a utility to help debug Map/Reduce programs.

        + IsolationRunner is a utility to help debug MapReduce programs.

        To use the IsolationRunner, first set keep.failed.tasks.files to true @@ -1950,7 +2012,7 @@

        User can specify whether the system should collect profiler information for some of the tasks in the job by setting the - configuration property mapred.task.profile. The + configuration property mapreduce.task.profile. The value can be set using the api JobConf.setProfileEnabled(boolean). If the value is set @@ -1960,15 +2022,15 @@

        Once user configures that profiling is needed, she/he can use the configuration property - mapred.task.profile.{maps|reduces} to set the ranges - of Map/Reduce tasks to profile. The value can be set using the api + mapreduce.task.profile.{maps|reduces} to set the ranges + of MapReduce tasks to profile. The value can be set using the api JobConf.setProfileTaskRange(boolean,String). By default, the specified range is 0-2.

        User can also specify the profiler configuration arguments by setting the configuration property - mapred.task.profile.params. The value can be specified + mapreduce.task.profile.params. The value can be specified using the api JobConf.setProfileParams(String). If the string contains a @@ -1982,8 +2044,8 @@

        Debugging -

        The Map/Reduce framework provides a facility to run user-provided - scripts for debugging. When a Map/Reduce task fails, a user can run +

        The MapReduce framework provides a facility to run user-provided + scripts for debugging. When a MapReduce task fails, a user can run a debug script, to process task logs for example. The script is given access to the task's stdout and stderr outputs, syslog and jobconf. The output from the debug script's stdout and stderr is @@ -2003,8 +2065,8 @@

        How to submit the script:

        A quick way to submit the debug script is to set values for the - properties mapred.map.task.debug.script and - mapred.reduce.task.debug.script, for debugging map and + properties mapreduce.map.debug.script and + mapreduce.reduce.debug.script, for debugging map and reduce tasks respectively. These properties can also be set by using APIs JobConf.setMapDebugScript(String) and @@ -2016,7 +2078,7 @@

        The arguments to the script are the task's stdout, stderr, syslog and jobconf files. The debug command, run on the node where - the Map/Reduce task failed, is:
        + the MapReduce task failed, is:
        $script $stdout $stderr $syslog $jobconf

        Pipes programs have the c++ program name as a fifth argument @@ -2036,14 +2098,14 @@ JobControl

        - JobControl is a utility which encapsulates a set of Map/Reduce jobs + JobControl is a utility which encapsulates a set of MapReduce jobs and their dependencies.

        Data Compression -

        Hadoop Map/Reduce provides facilities for the application-writer to +

        Hadoop MapReduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the job-outputs i.e. output of the reduces. It also comes bundled with @@ -2052,10 +2114,11 @@ algorithm. The gzip file format is also supported.

        -

        Hadoop also provides native implementations of the above compression +

        Hadoop also provides native implementations of the above compression codecs for reasons of both performance (zlib) and non-availability of - Java libraries. More details on their usage and availability are - available here.

        + Java libraries. For more information see the + Native Libraries Guide.

        +
        Intermediate Outputs @@ -2172,13 +2235,13 @@ Example: WordCount v2.0

        Here is a more complete WordCount which uses many of the - features provided by the Map/Reduce framework we discussed so far.

        + features provided by the MapReduce framework we discussed so far.

        -

        This needs the HDFS to be up and running, especially for the +

        This example needs the HDFS to be up and running, especially for the DistributedCache-related features. Hence it only works with a - pseudo-distributed or - fully-distributed - Hadoop installation.

        + pseudo-distributed (Single Node Setup) + or fully-distributed (Cluster Setup) + Hadoop installation.

        Source Code @@ -2367,7 +2430,7 @@
    30.        - inputFile = job.get("map.input.file"); + inputFile = job.get("mapreduce.map.input.file");
    -D property=value Optional Use value for given property
    -fs host:port or local Optional Specify a namenode
    -jt host:port or local Optional Specify a job tracker
    -files Optional Specify comma-separated files to be copied to the Map/Reduce cluster
    -files Optional Specify comma-separated files to be copied to the MapReduce cluster
    -libjars Optional Specify comma-separated jar files to include in the classpath
    -archives Optional Specify comma-separated archives to be unarchived on the compute machines
    @@ -223,9 +223,9 @@ To specify additional local temp directories use:

    - -D mapred.local.dir=/tmp/local - -D mapred.system.dir=/tmp/system - -D mapred.temp.dir=/tmp/temp + -D mapreduce.cluster.local.dir=/tmp/local + -D mapreduce.jobtracker.system.dir=/tmp/system + -D mapreduce.cluster.temp.dir=/tmp/temp

    Note: For more details on jobconf parameters see: mapred-default.html

    @@ -234,14 +234,14 @@
    Specifying Map-Only Jobs

    -Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. -The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job. +Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. +The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.

    - -D mapred.reduce.tasks=0 + -D mapreduce.job.reduces=0

    -To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0". +To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapreduce.job.reduces=0".

    @@ -252,7 +252,7 @@

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ - -D mapred.reduce.tasks=2 \ + -D mapreduce.job.reduces=2 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ @@ -263,7 +263,7 @@
    Customizing How Lines are Split into Key/Value Pairs

    -As noted earlier, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. +As noted earlier, when the MapReduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value.

    @@ -290,7 +290,7 @@ Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and "-D stream.num.reduce.output.fields=NUM" to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value.

    -

    Similarly, you can specify "stream.map.input.field.separator" and "stream.reduce.input.field.separator" as the input separator for Map/Reduce +

    Similarly, you can specify "stream.map.input.field.separator" and "stream.reduce.input.field.separator" as the input separator for MapReduce inputs. By default the separator is the tab character.

    @@ -306,8 +306,7 @@

    Note: The -files and -archives options are generic options. Be sure to place the generic options before the command options, otherwise the command will fail. -For an example, see The -archives Option. -Also see Other Supported Options. +For an example, see Making Archives Available to Tasks.

    @@ -323,6 +322,10 @@ -files hdfs://host:fs_port/user/testfile.txt +

    User can specify a different symlink name for -files using #.

    + +-files hdfs://host:fs_port/user/testfile.txt#testfile +

    Multiple entries can be specified like this:

    @@ -343,6 +346,10 @@ -archives hdfs://host:fs_port/user/testfile.jar +

    User can specify a different symlink name for -archives using #.

    + +-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir +

    In this example, the input.txt file has two lines specifying the names of the two files: cachedir.jar/cache.txt and cachedir.jar/cache2.txt. @@ -351,9 +358,9 @@ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \ - -D mapred.map.tasks=1 \ - -D mapred.reduce.tasks=1 \ - -D mapred.job.name="Experiment" \ + -D mapreduce.job.maps=1 \ + -D mapreduce.job.reduces=1 \ + -D mapreduce.job.name="Experiment" \ -input "/user/me/samples/cachefile/input.txt" \ -output "/user/me/samples/cachefile/out" \ -mapper "xargs cat" \ @@ -401,7 +408,7 @@

    Hadoop has a library class, KeyFieldBasedPartitioner, -that is useful for many applications. This class allows the Map/Reduce +that is useful for many applications. This class allows the MapReduce framework to partition the map outputs based on certain key fields, not the whole keys. For example:

    @@ -409,9 +416,9 @@ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ - -D map.output.key.field.separator=. \ - -D mapred.text.key.partitioner.options=-k1,2 \ - -D mapred.reduce.tasks=12 \ + -D mapreduce.map.output.key.field.separator=. \ + -D mapreduce.partition.keypartitioner.options=-k1,2 \ + -D mapreduce.job.reduces=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ @@ -421,11 +428,11 @@

    Here, -D stream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper.

    -The map output keys of the above Map/Reduce job normally have four fields -separated by ".". However, the Map/Reduce framework will partition the map +The map output keys of the above MapReduce job normally have four fields +separated by ".". However, the MapReduce framework will partition the map outputs by the first two fields of the keys using the --D mapred.text.key.partitioner.options=-k1,2 option. -Here, -D map.output.key.field.separator=. specifies the separator +-D mapreduce.partition.keypartitioner.options=-k1,2 option. +Here, -D mapreduce.map.output.key.field.separator=. specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.

    @@ -470,22 +477,22 @@

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ - -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \ + -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \ -D stream.map.output.field.separator=. \ -D stream.num.map.output.key.fields=4 \ - -D map.output.key.field.separator=. \ - -D mapred.text.key.comparator.options=-k2,2nr \ - -D mapred.reduce.tasks=12 \ + -D mapreduce.map.output.key.field.separator=. \ + -D mapreduce.partition.keycomparator.options=-k2,2nr \ + -D mapreduce.job.reduces=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer

    -The map output keys of the above Map/Reduce job normally have four fields -separated by ".". However, the Map/Reduce framework will sort the +The map output keys of the above MapReduce job normally have four fields +separated by ".". However, the MapReduce framework will sort the outputs by the second field of the keys using the --D mapred.text.key.comparator.options=-k2,2nr option. +-D mapreduce.partition.keycomparator.options=-k2,2nr option. Here, -n specifies that the sorting is numerical sorting and -r specifies that the result should be reversed. A simple illustration is shown below: @@ -526,7 +533,7 @@

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ - -D mapred.reduce.tasks=12 \ + -D mapreduce.job.reduces=12 \ -input myInputDirs \ -output myOutputDir \ -mapper myAggregatorForKeyCount.py \ @@ -571,11 +578,11 @@ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D map.output.key.field.separa=. \ - -D mapred.text.key.partitioner.options=-k1,2 \ - -D mapred.data.field.separator=. \ - -D map.output.key.value.fields.spec=6,5,1-3:0- \ - -D reduce.output.key.value.fields.spec=0-2:5- \ - -D mapred.reduce.tasks=12 \ + -D mapreduce.partition.keypartitioner.options=-k1,2 \ + -D mapreduce.fieldsel.data.field.separator=. \ + -D mapreduce.fieldsel.mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0- \ + -D mapreduce.fieldsel.mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5- \ + -D mapreduce.job.reduces=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \ @@ -584,13 +591,13 @@

    -The option "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifies key/value selection for the map outputs. +The option "-D mapreduce.fieldsel.mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0-" specifies key/value selection for the map outputs. Key selection spec and value selection spec are separated by ":". In this case, the map output key will consist of fields 6, 5, 1, 2, and 3. The map output value will consist of all fields (0- means field 0 and all the subsequent fields).

    -The option "-D reduce.output.key.value.fields.spec=0-2:5-" specifies +The option "-D mapreduce.fieldsel.mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5-" specifies key/value selection for the reduce outputs. In this case, the reduce output key will consist of fields 0, 1, 2 (corresponding to the original fields 6, 5, 1). The reduce output value will consist of all fields starting @@ -653,7 +660,7 @@

    How many reducers should I use?

    -See the Hadoop Wiki for details: Reducer +For details see Reducer.

    @@ -676,7 +683,7 @@ dan 75 $ c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ - -D mapred.job.name='Experiment' + -D mapreduce.job.name='Experiment' -input /user/me/samples/student_marks -output /user/me/samples/student_out -mapper \"$c2\" -reducer 'cat' @@ -735,7 +742,7 @@
    How do I generate output files with gzip format?

    -Instead of plain text files, you can generate gzip files as your generated output. Pass '-D mapred.output.compress=true -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' as option to your streaming job. +Instead of plain text files, you can generate gzip files as your generated output. Pass '-D mapreduce.output.fileoutputformat.compress=true -D mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.GzipCodec' as option to your streaming job.

    @@ -790,9 +797,9 @@
    How do I get the JobConf variables in a streaming job's mapper/reducer?

    -See Configured Parameters. +See the Configured Parameters. During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ). -For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the parameter names with the underscores. +For example, mapreduce.job.id becomes mapreduce.job.id and mapreduce.job.jar becomes mapreduce.job.jar. In your code, use the parameter names with the underscores.

    Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml Sat Nov 28 20:26:01 2009 @@ -30,8 +30,8 @@ directory (ends in '/'), in which case /index.html will be added --> - - - + + + Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml Sat Nov 28 20:26:01 2009 @@ -29,8 +29,8 @@
    Purpose -

    This document describes various user-facing facets of Hadoop Vaidya, a performance diagnostic tool for map/reduce jobs. It - describes how to execute a default set of rules against your map/reduce job counters and +

    This document describes various user-facing facets of Hadoop Vaidya, a performance diagnostic tool for MapReduce jobs. It + describes how to execute a default set of rules against your MapReduce job counters and how to write and execute new rules to detect specific performance problems.

    A few sample test rules are provided with the tool with the objective of growing the rules database over the time. @@ -41,7 +41,7 @@

    - Pre-requisites + Prerequisites

    Ensure that Hadoop is installed and configured. More details:

      @@ -59,11 +59,11 @@

      Hadoop Vaidya (Vaidya in Sanskrit language means "one who knows", or "a physician") is a rule based performance diagnostic tool for - Map/Reduce jobs. It performs a post execution analysis of map/reduce + MapReduce jobs. It performs a post execution analysis of MapReduce job by parsing and collecting execution statistics through job history and job configuration files. It runs a set of predefined tests/rules against job execution statistics to diagnose various performance problems. - Each test rule detects a specific performance problem with the Map/Reduce job and provides + Each test rule detects a specific performance problem with the MapReduce job and provides a targeted advice to the user. This tool generates an XML report based on the evaluation results of individual test rules.

      @@ -75,9 +75,9 @@

      This section describes main concepts and terminology involved with Hadoop Vaidya,

        -
      • PostExPerformanceDiagnoser: This class extends the base Diagnoser class and acts as a driver for post execution performance analysis of Map/Reduce Jobs. +
      • PostExPerformanceDiagnoser: This class extends the base Diagnoser class and acts as a driver for post execution performance analysis of MapReduce Jobs. It detects performance inefficiencies by executing a set of performance diagnosis rules against the job execution statistics.
      • -
      • Job Statistics: This includes the job configuration information (job.xml) and various counters logged by Map/Reduce job as a part of the job history log +
      • Job Statistics: This includes the job configuration information (job.xml) and various counters logged by MapReduce job as a part of the job history log file. The counters are parsed and collected into the Job Statistics data structures, which contains global job level aggregate counters and a set of counters for each Map and Reduce task.
      • Diagnostic Test/Rule: This is a program logic that detects the inefficiency of M/R job based on the job statistics. The @@ -139,10 +139,10 @@
    - How to Write and Execute your own Tests + How to Write and Execute Your Own Tests

    Writing and executing your own test rules is not very hard. You can take a look at Hadoop Vaidya source code for existing set of tests. - The source code is at this hadoop svn repository location - . The default set of tests are under "postexdiagnosis/tests/" folder.

    + The source code is at this hadoop svn repository location. + The default set of tests are under "postexdiagnosis/tests/" folder.

    • Writing a test class for your new test case should extend the org.apache.hadoop.vaidya.DiagnosticTest class and it should override following three methods from the base class, Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml Sat Nov 28 20:26:01 2009 @@ -68,7 +68,7 @@ Hadoop Scalable Computing Platform http://hadoop.apache.org/core/ - images/core-logo.gif + images/mapreduce-logo.jpg Hadoop @@ -146,13 +146,13 @@ #content h1 { margin-bottom: .5em; - font-size: 200%; color: black; + font-size: 185%; color: black; font-family: arial; } - h2, .h3 { font-size: 195%; color: black; font-family: arial; } - h3, .h4 { font-size: 140%; color: black; font-family: arial; margin-bottom: 0.5em; } + h2, .h3 { font-size: 175%; color: black; font-family: arial; } + h3, .h4 { font-size: 135%; color: black; font-family: arial; margin-bottom: 0.5em; } h4, .h5 { font-size: 125%; color: black; font-style: italic; font-weight: bold; font-family: arial; } - h5, h6 { font-size: 110%; color: #363636; font-weight: bold; } + h5, h6 { font-size: 110%; color: #363636; font-weight: bold; } pre.code { Propchange: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/ ------------------------------------------------------------------------------ --- svn:mergeinfo (original) +++ svn:mergeinfo Sat Nov 28 20:26:01 2009 @@ -1,3 +1,3 @@ /hadoop/core/branches/branch-0.19/mapred/src/examples:713112 /hadoop/core/trunk/src/examples:776175-784663 -/hadoop/mapreduce/trunk/src/examples:804974-807678 +/hadoop/mapreduce/trunk/src/examples:804974-884916 Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java Sat Nov 28 20:26:01 2009 @@ -68,7 +68,8 @@ public static final String DESCRIPTION = "A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi."; - private static final String NAME = BaileyBorweinPlouffe.class.getSimpleName(); + private static final String NAME = "mapreduce." + + BaileyBorweinPlouffe.class.getSimpleName(); //custom job properties private static final String WORKING_DIR_PROPERTY = NAME + ".dir"; @@ -327,11 +328,11 @@ job.setInputFormatClass(BbpInputFormat.class); // disable task timeout - jobconf.setLong("mapred.task.timeout", 0); + jobconf.setLong(JobContext.TASK_TIMEOUT, 0); // do not use speculative execution - jobconf.setBoolean("mapred.map.tasks.speculative.execution", false); - jobconf.setBoolean("mapred.reduce.tasks.speculative.execution", false); + jobconf.setBoolean(JobContext.MAP_SPECULATIVE, false); + jobconf.setBoolean(JobContext.REDUCE_SPECULATIVE, false); return job; } Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java Sat Nov 28 20:26:01 2009 @@ -59,14 +59,12 @@ pgd.addClass("secondarysort", SecondarySort.class, "An example defining a secondary sort to the reduce."); pgd.addClass("sudoku", Sudoku.class, "A sudoku solver."); - pgd.addClass("sleep", SleepJob.class, "A job that sleeps at each map and reduce task."); pgd.addClass("join", Join.class, "A job that effects a join over sorted, equally partitioned datasets"); pgd.addClass("multifilewc", MultiFileWordCount.class, "A job that counts words from several files."); pgd.addClass("dbcount", DBCountPageView.class, "An example job that count the pageview counts from a database."); pgd.addClass("teragen", TeraGen.class, "Generate data for the terasort"); pgd.addClass("terasort", TeraSort.class, "Run the terasort"); pgd.addClass("teravalidate", TeraValidate.class, "Checking results of terasort"); - pgd.addClass("fail", FailJob.class, "a job that always fails"); exitCode = pgd.driver(argv); } catch(Throwable e){ Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java Sat Nov 28 20:26:01 2009 @@ -52,9 +52,9 @@ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); Configuration conf = getConf(); - conf.set("mapred.mapper.regex", args[2]); + conf.set(RegexMapper.PATTERN, args[2]); if (args.length == 4) - conf.set("mapred.mapper.regex.group", args[3]); + conf.set(RegexMapper.GROUP, args[3]); Job grepJob = new Job(conf); Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java?rev=885145&r1=885144&r2=885145&view=diff ============================================================================== --- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java (original) +++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java Sat Nov 28 20:26:01 2009 @@ -52,7 +52,7 @@ * [in-dir]* in-dir out-dir */ public class Join extends Configured implements Tool { - + public static String REDUCES_PER_HOST = "mapreduce.join.reduces_per_host"; static int printUsage() { System.out.println("join [-r ] " + "[-inFormat ] " + @@ -77,7 +77,7 @@ JobClient client = new JobClient(conf); ClusterStatus cluster = client.getClusterStatus(); int num_reduces = (int) (cluster.getMaxReduceTasks() * 0.9); - String join_reduces = conf.get("mapreduce.join.reduces_per_host"); + String join_reduces = conf.get(REDUCES_PER_HOST); if (join_reduces != null) { num_reduces = cluster.getTaskTrackers() * Integer.parseInt(join_reduces);