Map/Reduce Tutorial

This document comprehensively describes all user-facing facets of the - Hadoop Map/Reduce framework and serves as a tutorial. + Hadoop MapReduce framework and serves as a tutorial.

Overview -

Hadoop Map/Reduce is a software framework for easily writing +

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A Map/Reduce job usually splits the input data-set into +

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the @@ -67,13 +66,14 @@ tasks.

Typically the compute nodes and the storage nodes are the same, that is, - the Map/Reduce framework and the Hadoop Distributed File System (see HDFS Architecture ) + the MapReduce framework and the + Hadoop Distributed File System (HDFS) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The Map/Reduce framework consists of a single master +

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The @@ -90,7 +90,7 @@ information to the job-client.

Although the Hadoop framework is implemented in Java^TM, - Map/Reduce applications need not be written in Java.

+ MapReduce applications need not be written in Java.

@@ -101,7 +101,7 @@
Hadoop Pipes is a SWIG- - compatible C++ API to implement Map/Reduce applications (non + compatible C++ API to implement MapReduce applications (non JNI^TM based).

@@ -110,7 +110,7 @@

Inputs and Outputs -

The Map/Reduce framework operates exclusively on +

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of @@ -124,7 +124,7 @@ WritableComparable interface to facilitate sorting by the framework.

Input and Output types of a Map/Reduce job:

Input and Output types of a MapReduce job:

(input) <k1, v1> -> @@ -145,14 +145,16 @@

Example: WordCount v1.0 -

Before we jump into the details, lets walk through an example Map/Reduce +

Before we jump into the details, lets walk through an example MapReduce application to get a flavour for how they work.

WordCount is a simple application that counts the number of occurences of each word in a given input set.

This works with a local-standalone, pseudo-distributed or fully-distributed - Hadoop installation(see Hadoop Quick Start).

This example works with a + pseudo-distributed (Single Node Setup) + or fully-distributed (Cluster Setup) + Hadoop installation.

Source Code @@ -605,17 +607,35 @@ would be present in the current working directory of the task using the option -files. The -libjars option allows applications to add jars to the classpaths of the maps - and reduces. The -archives allows them to pass archives - as arguments that are unzipped/unjarred and a link with name of the - jar/zip are created in the current working directory of tasks. More + and reduces. The option -archives allows them to pass + comma separated list of archives as arguments. These archives are + unarchived and a link with name of the archive is created in + the current working directory of tasks. More details about the command line options are available at - Hadoop Command Guide.

+ Hadoop Commands Guide.

Running wordcount example with - -libjars and -files:
+ -libjars, -files and -archives: +
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt - -libjars mylib.jar input output -

+ -libjars mylib.jar -archives myarchive.zip input output + Here, myarchive.zip will be placed and unzipped into a directory + by the name "myarchive.zip" +

+ +

Users can specify a different symbolic name for + files and archives passed through -files and -archives option, using #. +

+ +

For example, + hadoop jar hadoop-examples.jar wordcount + -files dir1/dict.txt#dict1,dir2/dict.txt#dict2 + -archives mytar.tgz#tgzdir input output + Here, the files dir1/dict.txt and dir2/dict.txt can be accessed by + tasks using the symbolic names dict1 and dict2 respectively. + And the archive mytar.tgz will be placed and unarchived into a + directory by the name tgzdir +

@@ -697,10 +717,10 @@

- Map/Reduce - User Interfaces + MapReduce - User Interfaces

This section provides a reasonable amount of detail on every user-facing - aspect of the Map/Reduce framwork. This should help users implement, + aspect of the MapReduce framwork. This should help users implement, configure and tune their jobs in a fine-grained manner. However, please note that the javadoc for each class/interface remains the most comprehensive documentation available; this is only meant to be a tutorial. @@ -739,7 +759,7 @@ to be of the same type as the input records. A given input pair may map to zero or many output pairs.

The Hadoop Map/Reduce framework spawns one map task for each +

The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

@@ -898,7 +918,7 @@

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * - mapred.tasktracker.reduce.tasks.maximum).

+ mapreduce.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. With @@ -950,7 +970,7 @@ Reporter

- Reporter is a facility for Map/Reduce applications to report + Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

@@ -960,7 +980,7 @@ significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to - set the configuration parameter mapred.task.timeout to a + set the configuration parameter mapreduce.task.timeout to a high-enough value (or even set it to zero for no time-outs).

@@ -973,12 +993,12 @@

OutputCollector is a generalization of the facility provided by - the Map/Reduce framework to collect data output by the + the MapReduce framework to collect data output by the Mapper or the Reducer (either the intermediate outputs or the output of the job).

Hadoop Map/Reduce comes bundled with a +

Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners.

@@ -987,10 +1007,10 @@ Job Configuration

- JobConf represents a Map/Reduce job configuration.

+ JobConf represents a MapReduce job configuration.

JobConf is the primary interface for a user to describe - a Map/Reduce job to the Hadoop framework for execution. The framework + a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by JobConf, however:

-Djava.library.path=<>

mapred.{map|reduce}.child.java.opts

@taskid@

taskid

Here is an example with multiple arguments and substitutions, showing jvm GC logging, and start of a passwordless JVM JMX agent so that @@ -1070,7 +1090,7 @@

<property>
-   <name>mapred.map.child.java.opts</name>
+   <name>mapreduce.map.java.opts</name>
  <value>
    -Xmx512M -Djava.library.path=/home/mycompany/lib @@ -1084,7 +1104,7 @@

<property> - <name>mapred.reduce.child.java.opts</name> + <name>mapreduce.reduce.java.opts</name> <value> -Xmx1024M -Djava.library.path=/home/mycompany/lib @@ -1109,9 +1129,9 @@

Note: mapred.{map|reduce}.child.java.opts are used only for configuring the launched child tasks from task tracker. Configuring - the memory options for daemons is documented in - - cluster_setup.html


+        the memory options for daemons is documented under
+        
+        Configuring the Environment of the Hadoop Daemons (Cluster Setup).
         
         The memory available to some parts of the framework is also
         configurable. In map and reduce tasks, performance may be influenced
@@ -1157,26 +1177,28 @@
 
           
-            
+            
-            
+            
-            
+            
             Name Type Description
io.sort.mb int
mapreduce.task.io.sort.mb int
                 The cumulative size of the serialization and accounting
                 buffers storing records emitted from the map, in megabytes.
                 
io.sort.record.percent float
mapreduce.map.sort.record.percent float
                 The ratio of serialization to accounting space can be
                 adjusted. Each serialized record requires 16 bytes of
                 accounting information in addition to its serialized size to
                 effect the sort. This percentage of space allocated from
-                io.sort.mb affects the probability of a spill to
+                mapreduce.task.io.sort.mb affects the 
+                probability of a spill to
                 disk being caused by either exhaustion of the serialization
                 buffer or the accounting space. Clearly, for a map outputting
                 small records, a higher value than the default will likely
                 decrease the number of spills to disk.
io.sort.spill.percent float
mapreduce.map.sort.spill.percent float
                 This is the threshold for the accounting and serialization
                 buffers. When this percentage of either buffer has filled,
                 their contents will be spilled to disk in the background. Let
-                io.sort.record.percent be r,
-                io.sort.mb be x, and this value be
+                mapreduce.map.sort.record.percent be r,
+                mapreduce.task.io.sort.mb be x, 
+                and this value be
                 q. The maximum number of records collected before the
                 collection thread will spill is r * x * q * 2^16.
                 Note that a higher value may decrease the number of- or even
@@ -1216,7 +1238,7 @@
 
           
-            
+            
-            
+            
-            
+            
-            
+            
-            
+            
@@ -3124,7 +3187,7 @@
         HighlightsThe second version of WordCount improves upon the 
-        previous one by using some features offered by the Map/Reduce framework:
+        previous one by using some features offered by the MapReduce framework:
         

           

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/site.xml Sat Nov 28 20:26:01 2009
@@ -34,39 +34,23 @@
   
     
 		
-		
-		
-		
-  	
+		
+		 
+   	
 		
- 
-		
-		
-		
-		
-		
-		
-		
-		
-		
+  
+		
+		
+		
+		
+		
    
    
-   
-		
-			
-		
-		
-		
-		
-		
-		 
-    
-   
-   
-		
-		
-		 
-    
+    
+        
+		
+		
+    
    
     
 		
@@ -78,19 +62,20 @@
     
    
   
-    
-    
-    
-    
-      
+    
+    
+    
+    
+           
     
-    
-    
-    
-    
-    
-    
-    
+    
+    
+    
+    
+    
+    
+    
+    
     
     
     

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/streaming.xml Sat Nov 28 20:26:01 2009
@@ -30,7 +30,7 @@
 Hadoop Streaming
 
 
-Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or 
+Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run MapReduce jobs with any executable or 
 script as the mapper and/or the reducer. For example:
 
 
@@ -47,7 +47,7 @@
 How Streaming Works 
 
 In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. 
-The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
+The utility will create a MapReduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
 
 
   When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. 
@@ -63,7 +63,7 @@
 prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized, as discussed later.
 
 
-This is the basis for the communication protocol between the Map/Reduce framework and the streaming mapper/reducer.
+This is the basis for the communication protocol between the MapReduce framework and the streaming mapper/reducer.
 
 
 You can supply a Java class as the mapper and/or the reducer. The above example is equivalent to:
@@ -161,7 +161,7 @@
 

 Specifying Other Plugins for Jobs 
 
-Just as with a normal Map/Reduce job, you can specify other plugins for a streaming job:
+Just as with a normal MapReduce job, you can specify other plugins for a streaming job:
 
 
    -inputformat JavaClassName
@@ -188,7 +188,7 @@
 
 
 Generic Command Options
-Streaming supports streaming command options as well as generic command options.
+
Streaming supports generic command options as well as streaming command options.
 The general command line syntax is shown below. 
 Note: Be sure to place the generic options before the streaming options, otherwise the command will fail. 
 For an example, see Making Archives Available to Tasks.
@@ -201,7 +201,7 @@
 
-
+
             Name Type Description
io.sort.factor int
mapreduce.task.io.sort.factor int
                 Specifies the number of segments on disk to be merged at
                 the same time. It limits the number of open files and
                 compression codecs during the merge. If the number of files
@@ -1224,7 +1246,7 @@
                 Though this limit also applies to the map, most jobs should be
                 configured so that hitting this limit is unlikely
                 there.
mapred.inmem.merge.threshold int
mapreduce.reduce.merge.inmem.threshold int
                 The number of sorted map outputs fetched into memory
                 before being merged to disk. Like the spill thresholds in the
                 preceding note, this is not defining a unit of partition, but
@@ -1233,7 +1255,7 @@
                 less expensive than merging from disk (see notes following
                 this table). This threshold influences only the frequency of
                 in-memory merges during the shuffle.
mapred.job.shuffle.merge.percent float
mapreduce.reduce.shuffle.merge.percent float
                 The memory threshold for fetched map outputs before an
                 in-memory merge is started, expressed as a percentage of
                 memory allocated to storing map outputs in memory. Since map
@@ -1243,14 +1265,14 @@
                 reduces whose input can fit entirely in memory. This parameter
                 influences only the frequency of in-memory merges during the
                 shuffle.
mapred.job.shuffle.input.buffer.percent float
mapreduce.reduce.shuffle.input.buffer.percent float
                 The percentage of memory- relative to the maximum heapsize
-                as typically specified in mapred.reduce.child.java.opts-
+                as typically specified in mapreduce.reduce.java.opts-
                 that can be allocated to storing map outputs during the
                 shuffle. Though some memory should be set aside for the
                 framework, in general it is advantageous to set this high
                 enough to store large and numerous map outputs.
mapred.job.reduce.input.buffer.percent float
mapreduce.reduce.input.buffer.percent float
                 The percentage of memory relative to the maximum heapsize
                 in which map outputs may be retained during the reduce. When
                 the reduce begins, map outputs will be merged to disk until
@@ -1275,7 +1297,8 @@
             than aggressively increasing buffer sizes.
             When merging in-memory map outputs to disk to begin the
             reduce, if an intermediate merge is necessary because there are
-            segments to spill and at least io.sort.factor
+            segments to spill and at least 
+            mapreduce.task.io.sort.factor
             segments already on disk, the in-memory map outputs will be part
             of the intermediate merge.
           
@@ -1285,7 +1308,7 @@
         
          Directory Structure 
         The task tracker has local directory,
-         ${mapred.local.dir}/taskTracker/ to create localized
+         ${mapreduce.cluster.local.dir}/taskTracker/ to create localized
         cache and localized job. It can define multiple local directories 
         (spanning multiple disks) and then each filename is assigned to a
         semi-random local directory. When the job starts, task tracker 
@@ -1293,24 +1316,24 @@
         specified in the configuration. Thus the task tracker directory 
         structure looks the following:          
         
-        ${mapred.local.dir}/taskTracker/archive/ :
+        
${mapreduce.cluster.local.dir}/taskTracker/archive/ :
         The distributed cache. This directory holds the localized distributed
         cache. Thus localized distributed cache is shared among all
         the tasks and jobs 
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/ :
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/ :
         The localized job directory 
         
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/work/ 
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/work/ 
         : The job-specific shared directory. The tasks can use this space as 
         scratch space and share files among them. This directory is exposed
         to the users through the configuration property  
-        job.local.dir. The directory can accessed through 
+        mapreduce.job.local.dir. The directory can accessed through 
         api 
         JobConf.getJobLocalDir(). It is available as System property also.
         So, users (streaming etc.) can call 
-        System.getProperty("job.local.dir") to access the 
+        System.getProperty("mapreduce.job.local.dir") to access the 
         directory.
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/jars/
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/jars/
         : The jars directory, which has the job jar file and expanded jar.
         The job.jar is the application's jar file that is
         automatically distributed to each machine. It is expanded in jars
@@ -1319,29 +1342,29 @@
          
         JobConf.getJar() . To access the unjarred directory,
         JobConf.getJar().getParent() can be called.
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/job.xml
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/job.xml
         : The job.xml file, the generic job configuration, localized for 
         the job. 
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid
         : The task directory for each task attempt. Each task directory
         again has the following structure :
         
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/job.xml
         : A job.xml file, task localized job configuration, Task localization
         means that properties have been set that are specific to
         this particular task within the job. The properties localized for 
         each task are described below.
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/output
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/output
         : A directory for intermediate output files. This contains the
         temporary map reduce data generated by the framework
         such as map output files etc. 
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/work
         : The curernt working directory of the task. 
         With jvm reuse enabled for tasks, this 
         directory will be the directory on which the jvm has started
-        ${mapred.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp
+        
${mapreduce.cluster.local.dir}/taskTracker/jobcache/$jobid/$taskid/work/tmp
         : The temporary directory for the task. 
-        (User can specify the property mapred.child.tmp to set
+        (User can specify the property mapreduce.task.tmp.dir to set
         the value of temporary directory for map and reduce tasks. This 
         defaults to ./tmp. If the value is not an absolute path,
         it is prepended with task's working directory. Otherwise, it is
@@ -1350,7 +1373,7 @@
         -Djava.io.tmpdir='the absolute path of the tmp dir'.
         Anp pipes and streaming are set with environment variable,
         TMPDIR='the absolute path of the tmp dir'). This 
-        directory is created, if mapred.child.tmp has the value
+        directory is created, if mapreduce.task.tmp.dir has the value
         ./tmp 
         
         
@@ -1362,7 +1385,7 @@
         
         Task JVM Reuse
         Jobs can enable task JVMs to be reused by specifying the job 
-        configuration mapred.job.reuse.jvm.num.tasks. If the
+        configuration mapreduce.job.jvm.numtasks. If the
         value is 1 (the default), then JVMs are not reused 
         (i.e. 1 task per JVM). If it is -1, there is no limit to the number
         of tasks a JVM can run (of the same job). One can also specify some
@@ -1377,26 +1400,26 @@
          for each task's execution: 
         
-          
-          
+          
+          
-          
+          
-          
+          
-          
+          
-          
+          
-          
+          
-          
+          
-          
+          
-          
+          
-          
+          
           Name Type Description
mapred.job.id String The job id
mapred.jar String
mapreduce.job.id String The job id
mapreduce.job.jar String
               job.jar location in job directory
job.local.dir  String
mapreduce.job.local.dir  String
                The job specific shared scratch space
mapred.tip.id  String
mapreduce.task.id  String
                The task id
mapred.task.id  String
mapreduce.task.attempt.id  String
                The task attempt id
mapred.task.is.map  boolean 
mapreduce.task.ismap  boolean 
               Is this a map task
mapred.task.partition  int 
mapreduce.task.partition  int 
               The id of the task within the job
map.input.file  String
mapreduce.map.input.file  String
                The filename that the map is reading from
map.input.start  long
mapreduce.map.input.start  long
                The offset of the start of the map input split
map.input.length long 
mapreduce.map.input.length long 
               The number of bytes in the map input split
mapred.work.output.dir  String 
mapreduce.task.output.dir  String 
               The task's temporary output directory
         
 
@@ -1404,7 +1427,7 @@
         Note:
         During the execution of a streaming job, the names of the "mapred" parameters are transformed. 
         The dots ( . ) become underscores ( _ ).
-        For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. 
+        For example, mapreduce.job.id becomes mapreduce.job.id and mapreduce.job.jar becomes mapreduce.job.jar. 
         To get the values in a streaming job's mapper/reducer use the parameter names with the underscores.
         
         
@@ -1428,9 +1451,9 @@
         System.loadLibrary or 
         
         System.load. More details on how to load shared libraries through 
-        distributed cache are documented at 
-        
-        native_libraries.html
+        distributed cache are documented under 
+        
+        Building Native Hadoop Libraries.
         
       
       
@@ -1442,7 +1465,7 @@
         with the JobTracker.
  
         JobClient provides facilities to submit jobs, track their 
-        progress, access component-tasks' reports and logs, get the Map/Reduce 
+        progress, access component-tasks' reports and logs, get the MapReduce 
         cluster's status information and so on.
  
         The job submission process involves:
@@ -1454,7 +1477,7 @@
             DistributedCache of the job, if necessary.
           
           
-            Copying the job's jar and configuration to the Map/Reduce system 
+            Copying the job's jar and configuration to the MapReduce system 
             directory on the FileSystem.
           
           
@@ -1462,23 +1485,16 @@
             monitoring it's status.
           
         
-         Job history files are also logged to user specified directory
-        hadoop.job.history.user.location 
-        which defaults to job output directory. The files are stored in
-        "_logs/history/" in the specified directory. Hence, by default they
-        will be in mapred.output.dir/_logs/history. User can stop
-        logging by giving the value none for 
-        hadoop.job.history.user.location
 
-         User can view the history logs summary in specified directory 
+        
 User can view the history log summary for a given history file
         using the following command 

-        $ bin/hadoop job -history output-dir
 
+        $ bin/hadoop job -history history-file
 
         This command will print job details, failed and killed tip
         details. 

         More details about the job such as successful tasks and 
         task attempts made for each task can be viewed using the  
         following command 

-       $ bin/hadoop job -history all output-dir
 
+       $ bin/hadoop job -history all history-file
 
             
          User can use 
         OutputLogFilter
@@ -1491,8 +1507,8 @@
         

           Job Control
  
-          Users may need to chain Map/Reduce jobs to accomplish complex
-          tasks which cannot be done via a single Map/Reduce job. This is fairly
+          
Users may need to chain MapReduce jobs to accomplish complex
+          tasks which cannot be done via a single MapReduce job. This is fairly
           easy since the output of the job typically goes to distributed 
           file-system, and the output, in turn, can be used as the input for the 
           next job.
@@ -1526,10 +1542,10 @@
         Job Input
         
         
-        InputFormat describes the input-specification for a Map/Reduce job.
+        InputFormat describes the input-specification for a MapReduce job.
          
  
-        The Map/Reduce framework relies on the InputFormat of 
+        
The MapReduce framework relies on the InputFormat of 
         the job to:
         
           Validate the input-specification of the job.
@@ -1552,7 +1568,7 @@
         InputSplit instances based on the total size, in bytes, of 
         the input files. However, the FileSystem blocksize of the 
         input files is treated as an upper bound for input splits. A lower bound
-        on the split size can be set via mapred.min.split.size.
+        on the split size can be set via mapreduce.input.fileinputformat.split.minsize.
  
         Clearly, logical splits based on input-size is insufficient for many
         applications since record boundaries must be respected. In such cases, 
@@ -1584,7 +1600,7 @@
           
           

           FileSplit is the default InputSplit. It sets 
-          map.input.file to the path of the input file for the
+          mapreduce.map.input.file to the path of the input file for the
           logical split.
         
         
@@ -1608,10 +1624,10 @@
         Job Output
         
         
-        OutputFormat describes the output-specification for a Map/Reduce 
+        OutputFormat describes the output-specification for a MapReduce 
         job.
 
-        The Map/Reduce framework relies on the OutputFormat of 
+        
The MapReduce framework relies on the OutputFormat of 
         the job to:
         
           
@@ -1652,9 +1668,9 @@
         
         
         OutputCommitter describes the commit of task output for a 
-        Map/Reduce job.
+        MapReduce job.
 
-        The Map/Reduce framework relies on the OutputCommitter
+        
The MapReduce framework relies on the OutputCommitter
         of the job to:
         
           
@@ -1712,34 +1728,34 @@
           (using the attemptid, say attempt_200709221812_0001_m_000000_0), 
           not just per task. 
  
-          To avoid these issues the Map/Reduce framework, when the 
+          
To avoid these issues the MapReduce framework, when the 
           OutputCommitter is FileOutputCommitter, 
           maintains a special 
-          ${mapred.output.dir}/_temporary/_${taskid} sub-directory
-          accessible via ${mapred.work.output.dir}
+          ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} sub-directory
+          accessible via ${mapreduce.task.output.dir}
           for each task-attempt on the FileSystem where the output
           of the task-attempt is stored. On successful completion of the 
           task-attempt, the files in the 
-          ${mapred.output.dir}/_temporary/_${taskid} (only) 
-          are promoted to ${mapred.output.dir}. Of course, 
+          ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_${taskid} (only) 
+          are promoted to ${mapreduce.output.fileoutputformat.outputdir}. Of course, 
           the framework discards the sub-directory of unsuccessful task-attempts. 
           This process is completely transparent to the application.
  
           The application-writer can take advantage of this feature by 
-          creating any side-files required in ${mapred.work.output.dir}
+          creating any side-files required in ${mapreduce.task.output.dir}
           during execution of a task via 
           
           FileOutputFormat.getWorkOutputPath(), and the framework will promote them 
           similarly for succesful task-attempts, thus eliminating the need to 
           pick unique paths per task-attempt.
           
-          Note: The value of ${mapred.work.output.dir} during 
+          
Note: The value of ${mapreduce.task.output.dir} during 
           execution of a particular task-attempt is actually 
-          ${mapred.output.dir}/_temporary/_{$taskid}, and this value is 
-          set by the Map/Reduce framework. So, just create any side-files in the 
+          ${mapreduce.output.fileoutputformat.outputdir}/_temporary/_{$taskid}, and this value is 
+          set by the MapReduce framework. So, just create any side-files in the 
           path  returned by
           
-          FileOutputFormat.getWorkOutputPath() from Map/Reduce 
+          FileOutputFormat.getWorkOutputPath() from MapReduce 
           task to take advantage of this feature.
           
           The entire discussion holds true for maps of jobs with 
@@ -1778,7 +1794,7 @@
           support multiple queues.
           
           A job defines the queue it needs to be submitted to through the
-          mapred.job.queue.name property, or through the
+          mapreduce.job.queuename property, or through the
           setQueueName(String)
           API. Setting the queue name is optional. If a job is submitted 
           without an associated queue name, it is submitted to the 'default' 
@@ -1788,7 +1804,7 @@
           Counters
           
           
Counters represent global counters, defined either by 
-          the Map/Reduce framework or applications. Each Counter can 
+          the MapReduce framework or applications. Each Counter can 
           be of any Enum type. Counters of a particular 
           Enum are bunched into groups of type 
           Counters.Group.
@@ -1812,7 +1828,7 @@
           files efficiently.
  
           DistributedCache is a facility provided by the 
-          Map/Reduce framework to cache files (text, archives, jars and so on) 
+          MapReduce framework to cache files (text, archives, jars and so on) 
           needed by applications.
  
           Applications specify the files to be cached via urls (hdfs://)
@@ -1858,7 +1874,7 @@
           directory of the task via the 
           
           DistributedCache.createSymlink(Configuration) api. Or by setting
-          the configuration property mapred.create.symlink
+          the configuration property mapreduce.job.cache.symlink.create
           as yes. The DistributedCache will use the 
           fragment of the URI as the name of the symlink. 
           For example, the URI 
@@ -1877,10 +1893,56 @@
           can be used to cache files/jars and also add them to the 
           classpath of child-jvm. The same can be done by setting
           the configuration properties 
-          mapred.job.classpath.{files|archives}. Similarly the
+          mapreduce.job.classpath.{files|archives}. Similarly the
           cached files that are symlinked into the working directory of the
           task can be used to distribute native libraries and load them.
           
+          The DistributedCache tracks modification timestamps 
+          of the cache files/archives. Clearly the cache files/archives should
+          not be modified by the application or externally 
+          while the job is executing.
+          
+          Here is an illustrative example on how to use the 
+          DistributedCache:

+           // Setting up the cache for the application
+           1. Copy the requisite files to the FileSystem:

+            $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
  
+            $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip 
 
+            $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar

+            $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar

+            $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz

+            $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz

+           2. Setup the job

+            Job job = new Job(conf);

+            job.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"));

+            job.addCacheArchive(new URI("/myapp/map.zip"));

+            job.addFileToClassPath(new Path("/myapp/mylib.jar"));

+            job.addCacheArchive(new URI("/myapp/mytar.tar"));

+            job.addCacheArchive(new URI("/myapp/mytgz.tgz"));

+            job.addCacheArchive(new URI("/myapp/mytargz.tar.gz"));

+      
+           3. Use the cached files in the 
+              {@link org.apache.hadoop.mapreduce.Mapper}
+              or {@link org.apache.hadoop.mapreduce.Reducer}:

+      
+              public static class MapClass extends Mapper<K, V, K, V> {

+                private Path[] localArchives;

+                private Path[] localFiles;

+                public void setup(Context context) {

+                 // Get the cached archives/files

+                 localArchives = context.getLocalCacheArchives();

+                 localFiles = context.getLocalCacheFiles();

+              }

+        
+              public void map(K key, V value, 
+                  Context context) throws IOException {

+                // Use data from the cached archives/files here

+                // ...

+                // ...

+                context.write(k, v);

+              }

+            }
+          
         
         
         
@@ -1890,7 +1952,7 @@
           interface supports the handling of generic Hadoop command-line options.
           
           
-          Tool is the standard for any Map/Reduce tool or 
+          
Tool is the standard for any MapReduce tool or 
           application. The application should delegate the handling of 
           standard command-line options to 
           
@@ -1923,7 +1985,7 @@
           IsolationRunner
           
           

-          IsolationRunner is a utility to help debug Map/Reduce programs.
+          IsolationRunner is a utility to help debug MapReduce programs.
           
           To use the IsolationRunner, first set 
           keep.failed.tasks.files to true 
@@ -1950,7 +2012,7 @@
           
           
User can specify whether the system should collect profiler
           information for some of the tasks in the job by setting the
-          configuration property mapred.task.profile. The
+          configuration property mapreduce.task.profile. The
           value can be set using the api 
           
           JobConf.setProfileEnabled(boolean). If the value is set 
@@ -1960,15 +2022,15 @@
           
           
Once user configures that profiling is needed, she/he can use
           the configuration property 
-          mapred.task.profile.{maps|reduces} to set the ranges
-          of Map/Reduce tasks to profile. The value can be set using the api 
+          mapreduce.task.profile.{maps|reduces} to set the ranges
+          of MapReduce tasks to profile. The value can be set using the api 
           
           JobConf.setProfileTaskRange(boolean,String).
           By default, the specified range is 0-2.
           
           User can also specify the profiler configuration arguments by 
           setting the configuration property 
-          mapred.task.profile.params. The value can be specified 
+          mapreduce.task.profile.params. The value can be specified 
           using the api
           
           JobConf.setProfileParams(String). If the string contains a 
@@ -1982,8 +2044,8 @@
         
         

           Debugging
-          The Map/Reduce framework provides a facility to run user-provided 
-          scripts for debugging. When a Map/Reduce task fails, a user can run 
+          
The MapReduce framework provides a facility to run user-provided 
+          scripts for debugging. When a MapReduce task fails, a user can run 
           a debug script, to process task logs for example. The script is 
           given access to the task's stdout and stderr outputs, syslog and 
           jobconf. The output from the debug script's stdout and stderr is 
@@ -2003,8 +2065,8 @@
           

            How to submit the script: 
            A quick way to submit the debug script is to set values for the 
-          properties mapred.map.task.debug.script and 
-          mapred.reduce.task.debug.script, for debugging map and 
+          properties mapreduce.map.debug.script and 
+          mapreduce.reduce.debug.script, for debugging map and 
           reduce tasks respectively. These properties can also be set by using APIs 
           
           JobConf.setMapDebugScript(String)  and
@@ -2016,7 +2078,7 @@
             
           
The arguments to the script are the task's stdout, stderr, 
           syslog and jobconf files. The debug command, run on the node where
-          the Map/Reduce task failed, is: 

+          the MapReduce task failed, is: 

            $script $stdout $stderr $syslog $jobconf   
 
            Pipes programs have the c++ program name as a fifth argument
@@ -2036,14 +2098,14 @@
           JobControl
           
           

-          JobControl is a utility which encapsulates a set of Map/Reduce jobs
+          JobControl is a utility which encapsulates a set of MapReduce jobs
           and their dependencies.
         
         
         
           Data Compression
           
-          Hadoop Map/Reduce provides facilities for the application-writer to
+          
Hadoop MapReduce provides facilities for the application-writer to
           specify compression for both intermediate map-outputs and the
           job-outputs i.e. output of the reduces. It also comes bundled with
           
@@ -2052,10 +2114,11 @@
           algorithm. The gzip file format is also
           supported.
           
-          Hadoop also provides native implementations of the above compression
+         
Hadoop also provides native implementations of the above compression
           codecs for reasons of both performance (zlib) and non-availability of
-          Java libraries. More details on their usage and availability are
-          available here.
+          Java libraries. For more information see the
+          Native Libraries Guide.
+          
           
           
             Intermediate Outputs
@@ -2172,13 +2235,13 @@
       Example: WordCount v2.0
       
       Here is a more complete WordCount which uses many of the
-      features provided by the Map/Reduce framework we discussed so far.
+      features provided by the MapReduce framework we discussed so far.
       
-      This needs the HDFS to be up and running, especially for the 
+      
This example needs the HDFS to be up and running, especially for the 
       DistributedCache-related features. Hence it only works with a 
-      pseudo-distributed or
-      fully-distributed 
-      Hadoop installation.      
+      pseudo-distributed (Single Node Setup) 
+     or fully-distributed (Cluster Setup) 
+      Hadoop installation.     
       
       
         Source Code
@@ -2367,7 +2430,7 @@
             30.
             
                     
-              inputFile = job.get("map.input.file");
+              inputFile = job.get("mapreduce.map.input.file");
             
           
           
         
         
         
 -D  property=value  Optional  Use value for given property 
  -fs host:port or local  Optional  Specify a namenode 
  -jt host:port or local  Optional  Specify a job tracker 
 -files  Optional  Specify comma-separated files to be copied to the Map/Reduce cluster 
 -files  Optional  Specify comma-separated files to be copied to the MapReduce cluster 
  -libjars   Optional  Specify comma-separated jar files to include in the classpath 
  -archives  Optional  Specify comma-separated archives to be unarchived on the compute machines 
 
@@ -223,9 +223,9 @@
 To specify additional local temp directories use:
 
 
-   -D mapred.local.dir=/tmp/local
-   -D mapred.system.dir=/tmp/system
-   -D mapred.temp.dir=/tmp/temp
+   -D mapreduce.cluster.local.dir=/tmp/local
+   -D mapreduce.jobtracker.system.dir=/tmp/system
+   -D mapreduce.cluster.temp.dir=/tmp/temp
 
 Note: For more details on jobconf parameters see:
 mapred-default.html
@@ -234,14 +234,14 @@
 
 Specifying Map-Only Jobs 
 
-Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. 
-The Map/Reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
+Often, you may want to process input data using a map function only. To do this, simply set mapreduce.job.reduces to zero. 
+The MapReduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job.
 
 
-    -D mapred.reduce.tasks=0
+    -D mapreduce.job.reduces=0
 
 
-To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapred.reduce.tasks=0".
+To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-D mapreduce.job.reduces=0".
 
 
 
@@ -252,7 +252,7 @@
 
 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-    -D mapred.reduce.tasks=2 \
+    -D mapreduce.job.reduces=2 \
     -input myInputDirs \
     -output myOutputDir \
     -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
@@ -263,7 +263,7 @@
 
 Customizing How Lines are Split into Key/Value Pairs
 
-As noted earlier, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. 
+As noted earlier, when the MapReduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. 
 By default, the prefix of the line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value.
 
 
@@ -290,7 +290,7 @@
 Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and "-D stream.num.reduce.output.fields=NUM" to specify 
 the nth field separator in a line of the reduce outputs as the separator between the key and the value.
 
- Similarly, you can specify "stream.map.input.field.separator" and "stream.reduce.input.field.separator" as the input separator for Map/Reduce 
+
 Similarly, you can specify "stream.map.input.field.separator" and "stream.reduce.input.field.separator" as the input separator for MapReduce 
 inputs. By default the separator is the tab character.
 
 
@@ -306,8 +306,7 @@
 Note:
 The -files and -archives options are generic options.
 Be sure to place the generic options before the command options, otherwise the command will fail. 
-For an example, see The -archives Option.
-Also see Other Supported Options.
+For an example, see Making Archives Available to Tasks.
 
 
 
@@ -323,6 +322,10 @@
 
 -files hdfs://host:fs_port/user/testfile.txt
 
+ User can specify a different symlink name for -files using #. 
+
+-files hdfs://host:fs_port/user/testfile.txt#testfile
+
 
 Multiple entries can be specified like this:
 
@@ -343,6 +346,10 @@
 
 -archives hdfs://host:fs_port/user/testfile.jar
 
+ User can specify a different symlink name for -archives using #. 
+
+-archives hdfs://host:fs_port/user/testfile.tgz#tgzdir
+
 
 
 In this example, the input.txt file has two lines specifying the names of the two files: cachedir.jar/cache.txt and cachedir.jar/cache2.txt. 
@@ -351,9 +358,9 @@
 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
                   -archives 'hdfs://hadoop-nn1.example.com/user/me/samples/cachefile/cachedir.jar' \  
-                  -D mapred.map.tasks=1 \
-                  -D mapred.reduce.tasks=1 \ 
-                  -D mapred.job.name="Experiment" \
+                  -D mapreduce.job.maps=1 \
+                  -D mapreduce.job.reduces=1 \ 
+                  -D mapreduce.job.name="Experiment" \
                   -input "/user/me/samples/cachefile/input.txt"  \
                   -output "/user/me/samples/cachefile/out" \  
                   -mapper "xargs cat"  \
@@ -401,7 +408,7 @@
 

 Hadoop has a library class, 
 KeyFieldBasedPartitioner, 
-that is useful for many applications. This class allows the Map/Reduce 
+that is useful for many applications. This class allows the MapReduce 
 framework to partition the map outputs based on certain key fields, not
 the whole keys. For example:
 
@@ -409,9 +416,9 @@
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
     -D stream.map.output.field.separator=. \
     -D stream.num.map.output.key.fields=4 \
-    -D map.output.key.field.separator=. \
-    -D mapred.text.key.partitioner.options=-k1,2 \
-    -D mapred.reduce.tasks=12 \
+    -D mapreduce.map.output.key.field.separator=. \
+    -D mapreduce.partition.keypartitioner.options=-k1,2 \
+    -D mapreduce.job.reduces=12 \
     -input myInputDirs \
     -output myOutputDir \
     -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
@@ -421,11 +428,11 @@
 
 Here, -D stream.map.output.field.separator=. and -D stream.num.map.output.key.fields=4 are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper. 
 

-The map output keys of the above Map/Reduce job normally have four fields
-separated by ".". However, the Map/Reduce framework will partition the map
+The map output keys of the above MapReduce job normally have four fields
+separated by ".". However, the MapReduce framework will partition the map
 outputs by the first two fields of the keys using the 
--D mapred.text.key.partitioner.options=-k1,2 option. 
-Here, -D map.output.key.field.separator=. specifies the separator 
+-D mapreduce.partition.keypartitioner.options=-k1,2 option. 
+Here, -D mapreduce.map.output.key.field.separator=. specifies the separator 
 for the partition. This guarantees that all the key/value pairs with the 
 same first two fields in the keys will be partitioned into the same reducer.
 

@@ -470,22 +477,22 @@
 
 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
+    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
     -D stream.map.output.field.separator=. \
     -D stream.num.map.output.key.fields=4 \
-    -D map.output.key.field.separator=. \
-    -D mapred.text.key.comparator.options=-k2,2nr \
-    -D mapred.reduce.tasks=12 \
+    -D mapreduce.map.output.key.field.separator=. \
+    -D mapreduce.partition.keycomparator.options=-k2,2nr \
+    -D mapreduce.job.reduces=12 \
     -input myInputDirs \
     -output myOutputDir \
     -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
     -reducer org.apache.hadoop.mapred.lib.IdentityReducer 
 
 
-The map output keys of the above Map/Reduce job normally have four fields
-separated by ".". However, the Map/Reduce framework will sort the 
+The map output keys of the above MapReduce job normally have four fields
+separated by ".". However, the MapReduce framework will sort the 
 outputs by the second field of the keys using the 
--D mapred.text.key.comparator.options=-k2,2nr option. 
+-D mapreduce.partition.keycomparator.options=-k2,2nr option. 
 Here, -n specifies that the sorting is numerical sorting and 
 -r specifies that the result should be reversed. A simple illustration
 is shown below:
@@ -526,7 +533,7 @@
 
 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-    -D mapred.reduce.tasks=12 \
+    -D mapreduce.job.reduces=12 \
     -input myInputDirs \
     -output myOutputDir \
     -mapper myAggregatorForKeyCount.py \
@@ -571,11 +578,11 @@
 
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
     -D map.output.key.field.separa=. \
-    -D mapred.text.key.partitioner.options=-k1,2 \
-    -D mapred.data.field.separator=. \
-    -D map.output.key.value.fields.spec=6,5,1-3:0- \
-    -D reduce.output.key.value.fields.spec=0-2:5- \
-    -D mapred.reduce.tasks=12 \
+    -D mapreduce.partition.keypartitioner.options=-k1,2 \
+    -D mapreduce.fieldsel.data.field.separator=. \
+    -D mapreduce.fieldsel.mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0- \
+    -D mapreduce.fieldsel.mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5- \
+    -D mapreduce.job.reduces=12 \
     -input myInputDirs \
     -output myOutputDir \
     -mapper org.apache.hadoop.mapred.lib.FieldSelectionMapReduce \
@@ -584,13 +591,13 @@
 
 
 
-The option "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifies key/value selection for the map outputs. 
+The option "-D mapreduce.fieldsel.mapreduce.fieldsel.map.output.key.value.fields.spec=6,5,1-3:0-" specifies key/value selection for the map outputs. 
 Key selection spec and value selection spec are separated by ":". 
 In this case, the map output key will consist of fields 6, 5, 1, 2, and 3. 
 The map output value will consist of all fields (0- means field 0 and all the subsequent fields). 
 
 
-The option "-D reduce.output.key.value.fields.spec=0-2:5-" specifies 
+The option "-D mapreduce.fieldsel.mapreduce.fieldsel.reduce.output.key.value.fields.spec=0-2:5-" specifies 
 key/value selection for the reduce outputs. In this case, the reduce 
 output key will consist of fields 0, 1, 2 (corresponding to the original 
 fields 6, 5, 1). The reduce output value will consist of all fields starting
@@ -653,7 +660,7 @@
 

 How many reducers should I use? 
 
-See the Hadoop Wiki for details: Reducer
+For details see Reducer.
 
 
 
@@ -676,7 +683,7 @@
 dan     75
 
 $ c2='cut -f2'; $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-    -D mapred.job.name='Experiment'
+    -D mapreduce.job.name='Experiment'
     -input /user/me/samples/student_marks 
     -output /user/me/samples/student_out 
     -mapper \"$c2\" -reducer 'cat'  
@@ -735,7 +742,7 @@
 
 How do I generate output files with gzip format? 
 
-Instead of plain text files, you can generate gzip files as your generated output. Pass '-D mapred.output.compress=true -D  mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' as option to your streaming job.
+Instead of plain text files, you can generate gzip files as your generated output. Pass '-D mapreduce.output.fileoutputformat.compress=true -D  mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.GzipCodec' as option to your streaming job.
 
 
 
@@ -790,9 +797,9 @@
 
 How do I get the JobConf variables in a streaming job's mapper/reducer?
 
-See Configured Parameters. 
+See the Configured Parameters. 
 During the execution of a streaming job, the names of the "mapred" parameters are transformed. The dots ( . ) become underscores ( _ ).
-For example, mapred.job.id becomes mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the parameter names with the underscores.
+For example, mapreduce.job.id becomes mapreduce.job.id and mapreduce.job.jar becomes mapreduce.job.jar. In your code, use the parameter names with the underscores.
 
 
 

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/tabs.xml Sat Nov 28 20:26:01 2009
@@ -30,8 +30,8 @@
     directory (ends in '/'), in which case /index.html will be added
   -->
 
-  
-  
-    
+  
+  
+    
   
 

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/content/xdocs/vaidya.xml Sat Nov 28 20:26:01 2009
@@ -29,8 +29,8 @@
     
       Purpose
       
-      This document describes various user-facing facets of Hadoop Vaidya, a performance diagnostic tool for map/reduce jobs. It
-         describes how to execute a default set of rules against your map/reduce job counters and
+      
This document describes various user-facing facets of Hadoop Vaidya, a performance diagnostic tool for MapReduce jobs. It
+         describes how to execute a default set of rules against your MapReduce job counters and
          how to write and execute new rules to detect specific performance problems. 
       
       A few sample test rules are provided with the tool with the objective of growing the rules database over the time. 
@@ -41,7 +41,7 @@
     
     
     
-      Pre-requisites
+      Prerequisites
       
       Ensure that Hadoop is installed and configured. More details: 
       
@@ -59,11 +59,11 @@
       
       Hadoop Vaidya (Vaidya in Sanskrit language means "one who knows", or "a physician") 
 	    is a rule based performance diagnostic tool for 
-        Map/Reduce jobs. It performs a post execution analysis of map/reduce 
+        MapReduce jobs. It performs a post execution analysis of MapReduce 
         job by parsing and collecting execution statistics through job history 
         and job configuration files. It runs a set of predefined tests/rules 
         against job execution statistics to diagnose various performance problems. 
-        Each test rule detects a specific performance problem with the Map/Reduce job and provides 
+        Each test rule detects a specific performance problem with the MapReduce job and provides 
         a targeted advice to the user. This tool generates an XML report based on 
         the evaluation results of individual test rules.
       
@@ -75,9 +75,9 @@
 	 
 	  This section describes main concepts and terminology involved with Hadoop Vaidya,
 		
-			 PostExPerformanceDiagnoser: This class extends the base Diagnoser class and acts as a driver for post execution performance analysis of Map/Reduce Jobs. 
+			
 PostExPerformanceDiagnoser: This class extends the base Diagnoser class and acts as a driver for post execution performance analysis of MapReduce Jobs. 
                        It detects performance inefficiencies by executing a set of performance diagnosis rules against the job execution statistics.
-			 Job Statistics: This includes the job configuration information (job.xml) and various counters logged by Map/Reduce job as a part of the job history log
+			
 Job Statistics: This includes the job configuration information (job.xml) and various counters logged by MapReduce job as a part of the job history log
 		           file. The counters are parsed and collected into the Job Statistics data structures, which contains global job level aggregate counters and 
 			     a set of counters for each Map and Reduce task.
 			 Diagnostic Test/Rule: This is a program logic that detects the inefficiency of M/R job based on the job statistics. The
@@ -139,10 +139,10 @@
 	
 	
     
-		How to Write and Execute your own Tests
+		How to Write and Execute Your Own Tests
 		Writing and executing your own test rules is not very hard. You can take a look at Hadoop Vaidya source code for existing set of tests. 
-		   The source code is at this hadoop svn repository location
-		   . The default set of tests are under "postexdiagnosis/tests/" folder.
+		   The source code is at this hadoop svn repository location. 
+		   The default set of tests are under "postexdiagnosis/tests/" folder.
 		
 		  Writing a test class for your new test case should extend the org.apache.hadoop.vaidya.DiagnosticTest class and 
 		       it should override following three methods from the base class, 

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/docs/src/documentation/skinconf.xml Sat Nov 28 20:26:01 2009
@@ -68,7 +68,7 @@
   Hadoop
   Scalable Computing Platform
   http://hadoop.apache.org/core/
-  images/core-logo.gif
+  images/mapreduce-logo.jpg
 
   
   Hadoop
@@ -146,13 +146,13 @@
     
 	#content h1 {
 	  margin-bottom: .5em;
-	  font-size: 200%; color: black;
+	  font-size: 185%; color: black;
 	  font-family: arial;
 	}  
-    h2, .h3 { font-size: 195%; color: black; font-family: arial; }
-	h3, .h4 { font-size: 140%; color: black; font-family: arial; margin-bottom: 0.5em; }
+    h2, .h3 { font-size: 175%; color: black; font-family: arial; }
+	h3, .h4 { font-size: 135%; color: black; font-family: arial; margin-bottom: 0.5em; }
 	h4, .h5 { font-size: 125%; color: black;  font-style: italic; font-weight: bold; font-family: arial; }
-	h5, h6 { font-size: 110%; color: #363636; font-weight: bold; } 
+	h5, h6 { font-size: 110%; color: #363636; font-weight: bold; }  
    
    
     pre.code {

Propchange: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/
------------------------------------------------------------------------------
--- svn:mergeinfo (original)
+++ svn:mergeinfo Sat Nov 28 20:26:01 2009
@@ -1,3 +1,3 @@
 /hadoop/core/branches/branch-0.19/mapred/src/examples:713112
 /hadoop/core/trunk/src/examples:776175-784663
-/hadoop/mapreduce/trunk/src/examples:804974-807678
+/hadoop/mapreduce/trunk/src/examples:804974-884916

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/BaileyBorweinPlouffe.java Sat Nov 28 20:26:01 2009
@@ -68,7 +68,8 @@
   public static final String DESCRIPTION
       = "A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.";
 
-  private static final String NAME = BaileyBorweinPlouffe.class.getSimpleName();
+  private static final String NAME = "mapreduce." + 
+    BaileyBorweinPlouffe.class.getSimpleName();
 
   //custom job properties
   private static final String WORKING_DIR_PROPERTY = NAME + ".dir";
@@ -327,11 +328,11 @@
     job.setInputFormatClass(BbpInputFormat.class);
 
     // disable task timeout
-    jobconf.setLong("mapred.task.timeout", 0);
+    jobconf.setLong(JobContext.TASK_TIMEOUT, 0);
 
     // do not use speculative execution
-    jobconf.setBoolean("mapred.map.tasks.speculative.execution", false);
-    jobconf.setBoolean("mapred.reduce.tasks.speculative.execution", false);
+    jobconf.setBoolean(JobContext.MAP_SPECULATIVE, false);
+    jobconf.setBoolean(JobContext.REDUCE_SPECULATIVE, false);
     return job;
   }
 

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/ExampleDriver.java Sat Nov 28 20:26:01 2009
@@ -59,14 +59,12 @@
       pgd.addClass("secondarysort", SecondarySort.class,
                    "An example defining a secondary sort to the reduce.");
       pgd.addClass("sudoku", Sudoku.class, "A sudoku solver.");
-      pgd.addClass("sleep", SleepJob.class, "A job that sleeps at each map and reduce task.");
       pgd.addClass("join", Join.class, "A job that effects a join over sorted, equally partitioned datasets");
       pgd.addClass("multifilewc", MultiFileWordCount.class, "A job that counts words from several files.");
       pgd.addClass("dbcount", DBCountPageView.class, "An example job that count the pageview counts from a database.");
       pgd.addClass("teragen", TeraGen.class, "Generate data for the terasort");
       pgd.addClass("terasort", TeraSort.class, "Run the terasort");
       pgd.addClass("teravalidate", TeraValidate.class, "Checking results of terasort");
-      pgd.addClass("fail", FailJob.class, "a job that always fails");
       exitCode = pgd.driver(argv);
     }
     catch(Throwable e){

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Grep.java Sat Nov 28 20:26:01 2009
@@ -52,9 +52,9 @@
           Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
 
     Configuration conf = getConf();
-    conf.set("mapred.mapper.regex", args[2]);
+    conf.set(RegexMapper.PATTERN, args[2]);
     if (args.length == 4)
-      conf.set("mapred.mapper.regex.group", args[3]);
+      conf.set(RegexMapper.GROUP, args[3]);
 
     Job grepJob = new Job(conf);
     

Modified: hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java?rev=885145&r1=885144&r2=885145&view=diff
==============================================================================
--- hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java (original)
+++ hadoop/mapreduce/branches/MAPREDUCE-233/src/examples/org/apache/hadoop/examples/Join.java Sat Nov 28 20:26:01 2009
@@ -52,7 +52,7 @@
  *            [in-dir]* in-dir out-dir 
  */
 public class Join extends Configured implements Tool {
-
+  public static String REDUCES_PER_HOST = "mapreduce.join.reduces_per_host";
   static int printUsage() {
     System.out.println("join [-r ] " +
                        "[-inFormat ] " +
@@ -77,7 +77,7 @@
     JobClient client = new JobClient(conf);
     ClusterStatus cluster = client.getClusterStatus();
     int num_reduces = (int) (cluster.getMaxReduceTasks() * 0.9);
-    String join_reduces = conf.get("mapreduce.join.reduces_per_host");
+    String join_reduces = conf.get(REDUCES_PER_HOST);
     if (join_reduces != null) {
        num_reduces = cluster.getTaskTrackers() * 
                        Integer.parseInt(join_reduces);