hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "TaskExecutionEnvironment" by AmareshwariSriRamadasu
Date Wed, 11 Jun 2008 12:02:47 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by AmareshwariSriRamadasu:
http://wiki.apache.org/hadoop/TaskExecutionEnvironment

------------------------------------------------------------------------------
  = NOTE: =
- The most up-to-date information is usually available at: [http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Execution+%26+Environment
Map-Reduce Tutorial], the information below might not be the most accurate/updated.
+ The most up-to-date information is usually available at: [http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Task+Execution+%26+Environment
Map-Reduce Tutorial].
  
- = Hadoop Map/Reduce Task Execution Environment =
- 
- Hadoop Map/Reduce tasks (the generic term for maps or reduces) run distributed across a
cluster and most tasks don't care about their environment, because they only use the ''standard''
inputs and outputs from the API, but some tasks do care and this page documents the details.
- 
- == Directories ==
- 
- All of the directories are relative to the ''<local>'' directory set in the !TaskTracker's
configuration. The !TaskTracker can define multiple local directories and each filename is
assigned to a semi-random local directory.
- 
- There are two directories. The first is ''<local>''/taskTracker/archive. This directory
holds the localized distributed cache. Thus localized distributed cache is shared among all
the tasks and jobs. The second is job specific directory ''<local>''/taskTracker/jobcache/''<jobId>''.
The job directory has the following structure.
- 
-   * ''<local>''/taskTracker/jobcache/''<jobId>''/                         --
The job directory
-          *                                       work                     -- The scratch
space
-          *                                       jars                     -- expanded jar
 
-          *                                       job.xml                  -- The generic
job conf.
-          *                                      ''<taskid>''/             -- The task
directory
-                 *                                           job.xml     -- Task localized
job conf 
-                 *                                           output      -- intermediate
map output files
-                 *                                           work        -- cwd of the task
- 
- The job directory contains the ''job.xml'', ''jars'', ''work'' and ''<taskid>'' directories.
The ''job.xml'' is the serialization of the job's !JobConf after it has been ''localized''
for that task. The job.jar is contained in  ''<local>''/taskTracker/jobcache/''<jobId>''/jars/.
The job.jar is the application's jar file that is automatically distributed to each machine.
It is expanded in ''jars'' directory before the tasks for the job start. The job.jar location
is accessible to the application through the api [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getJar()
JobConf.getJar()]. To access the unjarred directory, JobConf.getJar().getParent() can be called.
- 
- The ''work'' directory in job directory is the job-specific shared directory. The tasks
can use this space as scratch space and share files among them. This directory is exposed
to the users through ''job.local.dir''. The directory can accessed through api [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getJobLocalDir()
JobConf.getJobLocalDir()]. It is available as System property also. So,users can call ''System.getProperty("job.local.dir")''.
- 
- The task directory in job directory contains ''job.xml'', ''output'' and ''work'' directories.
The ''job.xml'' is the JobConf localized for the task. Task localization means that properties
have been set that are specific to this particular task within the job. The ''output'' directory
contains the temporary map reduce data generated by the framework such as map output files
etc. The ''work'' directory in task directory is the working directory of the child process.
The work directory has a ''tmp'' directory to create temporary files, if ''mapred.child.tmp''
has the value ''./tmp''. 
- 
- == Processes ==
- 
- The task is in its own Java virtual machine that forks from the !TaskTracker. The !TaskTracker
waits for the child process to finish and logs the event if a non-zero exit code is returned.
- 
- The task's class path is set to the server's class path followed by all of the jars in the
lib directory from the expanded job.jar followed by the expanded job.jar itself.
- 
- == Outputs ==
- 
- === Output Streams ===
- 
- The standard output (stdout) and error (stderr) streams are read by the !TaskTracker and
logged to its log at the INFO level under the org.apache.hadoop.mapred.!TaskRunner logger.
- 
- === Filenames ===
- 
- Map tasks put their outputs into ''<local>''/''<taskId>''/part-''<reduce>''.out.
- 
- Reduce tasks read their inputs from ''<local>''/''<taskId>''/map_''<map>''.out.
- 
- == Localized Properties in the JobConf ==
- 
- The following properties are localized for each task's !JobConf:
- 
- || '''Name''' || '''Type''' || '''Description''' ||
- || mapred.job.id || String || The job id ||
- || mapred.jar || String || job.jar location in job directory ||
- || job.local.dir || String  || The job specific shared scratch space ||
- || mapred.task.id || String || The task id ||
- || mapred.task.is.map || boolean || Is this a map task ||
- || mapred.task.partition || int || The id of the task within the job ||
- || map.input.file || String || The filename that the map is reading from ||
- || map.input.start || long || The offset of the start of the map input split ||
- || map.input.length || long || The number of bytes in the map input split ||
- || mapred.work.output.dir || String || The task's temporary output directory ||
- 
- == Accessing JobConf in MapReduce Programs ==
- 
- The following is an example of how to access the current JobConf from inside of your Map
or Reduce functions while they are executing. One key thing to note is that the Map class
must not be static (as they are in the examples) and that it needs to override {{{configure}}}
in order to get access to the JobConf. You will need to set the property before you call {{{JobClient.runJob()}}}.
This same concept can be used for Reducer and Combiner class implementations.
- 
- {{{
- import java.io.IOException;
- import org.apache.hadoop.io.*;
- import org.apache.hadoop.mapred.*;
- 
- public class Map extends MapReduceBase implements Mapper<Text, Text, Text, IntWritable>
{
-    protected Integer MIN_VALUE = null;
-    public static final String MIN_VALUE_KEY = "test.minvalue";
- 
-    // Set the MIN_VALUE property
-    public void configure(JobConf job) {
-       super.configure(job);
-       // Get the min value from the current JobConf object
-       // If it was not set, then the resulting value will be null
-       String property = job.get(Map.MIN_VALUE_KEY);
-       if (property == null) {
-          System.err.println("ERROR: The property '" + Map.MIN_VALUE_KEY + "' was not set");
-          System.exit(1);
-       }
-       this.MIN_VALUE = Integer.parseInt(property);
-    }
- 
-    // Check whether the value is greater than our MIN_VALUE
-    public void map(Text key,
-                    Text value,
-                    OutputCollector<Text, IntWritable> output,
-                    Reporter reporter) throws IOException {
-       Integer temp = Integer.valueOf(value.toString());
-       if (temp > this.MIN_VALUE) {
-          output.collect(key, new IntWritable(temp));
-       }
-       return;
-    }
- } // END CLASS
- }}}
- 

Mime
View raw message