Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 55862 invoked from network); 30 Nov 2008 09:38:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 30 Nov 2008 09:38:47 -0000 Received: (qmail 69140 invoked by uid 500); 30 Nov 2008 09:38:59 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 69113 invoked by uid 500); 30 Nov 2008 09:38:59 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 69102 invoked by uid 99); 30 Nov 2008 09:38:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Nov 2008 01:38:59 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 30 Nov 2008 09:37:30 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 8B27923889B7; Sun, 30 Nov 2008 01:37:47 -0800 (PST) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r721790 [1/3] - in /hadoop/core/branches/branch-0.19: CHANGES.txt docs/mapred_tutorial.html docs/mapred_tutorial.pdf src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Date: Sun, 30 Nov 2008 09:37:46 -0000 To: core-commits@hadoop.apache.org From: cdouglas@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20081130093747.8B27923889B7@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: cdouglas Date: Sun Nov 30 01:37:46 2008 New Revision: 721790 URL: http://svn.apache.org/viewvc?rev=721790&view=rev Log: HADOOP-4739. Fix spelling and grammar, improve phrasing of some sections in mapred tutorial. Contributed by Vivek Ratan. Modified: hadoop/core/branches/branch-0.19/CHANGES.txt hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html hadoop/core/branches/branch-0.19/docs/mapred_tutorial.pdf hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/mapred_tutorial.xml Modified: hadoop/core/branches/branch-0.19/CHANGES.txt URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/CHANGES.txt?rev=721790&r1=721789&r2=721790&view=diff ============================================================================== --- hadoop/core/branches/branch-0.19/CHANGES.txt (original) +++ hadoop/core/branches/branch-0.19/CHANGES.txt Sun Nov 30 01:37:46 2008 @@ -2,6 +2,11 @@ Release 0.19.1 - Unreleased + IMPROVEMENTS + + HADOOP-4739. Fix spelling and grammar, improve phrasing of some sections in + mapred tutorial. (Vivek Ratan via cdouglas) + BUG FIXES HADOOP-4697. Fix getBlockLocations in KosmosFileSystem to handle multiple Modified: hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html?rev=721790&r1=721789&r2=721790&view=diff ============================================================================== --- hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html (original) +++ hadoop/core/branches/branch-0.19/docs/mapred_tutorial.html Sun Nov 30 01:37:46 2008 @@ -314,7 +314,7 @@ Other Useful Features
  • -Submitting Jobs to a Queue +Submitting Jobs to Queues
  • Counters @@ -351,7 +351,7 @@ Example: WordCount v2.0
    • -Source Code +Source Code
    • Sample Runs @@ -2283,23 +2283,26 @@ FileSystem.

      Other Useful Features

      - -

      Submitting Jobs to a Queue

      -

      Some job schedulers supported in Hadoop, like the - Capacity - Scheduler, support multiple queues. If such a scheduler is - being used, users can submit jobs to one of the queues - administrators would have defined in the - mapred.queue.names property of the Hadoop site - configuration. The queue name can be specified through the - mapred.job.queue.name property, or through the - setQueueName(String) - API. Note that administrators may choose to define ACLs - that control which queues a job can be submitted to by a - given user. In that case, if the job is not submitted - to one of the queues where the user has access, - the job would be rejected.

      - + +

      Submitting Jobs to Queues

      +

      Users submit jobs to Queues. Queues, as collection of jobs, + allow the system to provide specific functionality. For example, + queues use ACLs to control which users + who can submit jobs to them. Queues are expected to be primarily + used by Hadoop Schedulers.

      +

      Hadoop comes configured with a single mandatory queue, called + 'default'. Queue names are defined in the + mapred.queue.names property of the Hadoop site + configuration. Some job schedulers, such as the + Capacity Scheduler, + support multiple queues.

      +

      A job defines the queue it needs to be submitted to through the + mapred.job.queue.name property, or through the + setQueueName(String) + API. Setting the queue name is optional. If a job is submitted + without an associated queue name, it is submitted to the 'default' + queue.

      +

      Counters

      Counters represent global counters, defined either by @@ -2316,7 +2319,7 @@ in the map and/or reduce methods. These counters are then globally aggregated by the framework.

      - +

      DistributedCache

      @@ -2387,7 +2390,7 @@ mapred.job.classpath.{files|archives}. Similarly the cached files that are symlinked into the working directory of the task can be used to distribute native libraries and load them.

      - +

      Tool

      The Tool interface supports the handling of generic Hadoop command-line options. @@ -2427,7 +2430,7 @@

      - +

      IsolationRunner

      @@ -2451,7 +2454,7 @@

      IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input.

      - +

      Profiling

      Profiling is a utility to get a representative (2 or 3) sample of built-in java profiler for a sample of maps and reduces.

      @@ -2484,39 +2487,40 @@ -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s

      - +

      Debugging

      -

      Map/Reduce framework provides a facility to run user-provided - scripts for debugging. When map/reduce task fails, user can run - script for doing post-processing on task logs i.e task's stdout, - stderr, syslog and jobconf. The stdout and stderr of the - user-provided debug script are printed on the diagnostics. - These outputs are also displayed on job UI on demand.

      -

      In the following sections we discuss how to submit debug script - along with the job. For submitting debug script, first it has to - distributed. Then the script has to supplied in Configuration.

      - -
      How to distribute script file:
      +

      The Map/Reduce framework provides a facility to run user-provided + scripts for debugging. When a map/reduce task fails, a user can run + a debug script, to process task logs for example. The script is + given access to the task's stdout and stderr outputs, syslog and + jobconf. The output from the debug script's stdout and stderr is + displayed on the console diagnostics and also as part of the + job UI.

      +

      In the following sections we discuss how to submit a debug script + with a job. The script file needs to be distributed and submitted to + the framework.

      + +
      How to distribute the script file:

      - The user has to use + The user needs to use DistributedCache - mechanism to distribute and symlink the - debug script file.

      - -
      How to submit script:
      -

      A quick way to submit debug script is to set values for the - properties "mapred.map.task.debug.script" and - "mapred.reduce.task.debug.script" for debugging map task and reduce - task respectively. These properties can also be set by using APIs + to distribute and symlink the script file.

      + +
      How to submit the script:
      +

      A quick way to submit the debug script is to set values for the + properties mapred.map.task.debug.script and + mapred.reduce.task.debug.script, for debugging map and + reduce tasks respectively. These properties can also be set by using APIs JobConf.setMapDebugScript(String) and - JobConf.setReduceDebugScript(String) . For streaming, debug - script can be submitted with command-line options -mapdebug, - -reducedebug for debugging mapper and reducer respectively.

      -

      The arguments of the script are task's stdout, stderr, + JobConf.setReduceDebugScript(String) . In streaming mode, a debug + script can be submitted with the command-line options + -mapdebug and -reducedebug, for debugging + map and reduce tasks respectively.

      +

      The arguments to the script are the task's stdout, stderr, syslog and jobconf files. The debug command, run on the node where - the map/reduce failed, is:
      + the map/reduce task failed, is:
      $script $stdout $stderr $syslog $jobconf

      @@ -2526,17 +2530,17 @@ $script $stdout $stderr $syslog $jobconf $program

      - +
      Default Behavior:

      For pipes, a default script is run to process core dumps under gdb, prints stack trace and gives info about running threads.

      - +

      JobControl

      JobControl is a utility which encapsulates a set of Map/Reduce jobs and their dependencies.

      - +

      Data Compression

      Hadoop Map/Reduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the @@ -2550,7 +2554,7 @@ codecs for reasons of both performance (zlib) and non-availability of Java libraries (lzo). More details on their usage and availability are available here.

      - +
      Intermediate Outputs

      Applications can control compression of intermediate map-outputs via the @@ -2559,7 +2563,7 @@ CompressionCodec to be used via the JobConf.setMapOutputCompressorClass(Class) api.

      - +
      Job Outputs

      Applications can control compression of job-outputs via the @@ -2576,64 +2580,60 @@ SequenceFileOutputFormat.setOutputCompressionType(JobConf, SequenceFile.CompressionType) api.

      - +

      Skipping Bad Records

      -

      Hadoop provides an optional mode of execution in which the bad - records are detected and skipped in further attempts. - Applications can control various settings via +

      Hadoop provides an option where a certain set of bad input + records can be skipped when processing map inputs. Applications + can control this feature through the - SkipBadRecords.

      -

      This feature can be used when map/reduce tasks crashes - deterministically on certain input. This happens due to bugs in the - map/reduce function. The usual course would be to fix these bugs. - But sometimes this is not possible; perhaps the bug is in third party - libraries for which the source code is not available. Due to this, - the task never reaches to completion even with multiple attempts and - complete data for that task is lost.

      -

      With this feature, only a small portion of data is lost surrounding - the bad record. This may be acceptable for some user applications; - for example applications which are doing statistical analysis on - very large data. By default this feature is disabled. For turning it - on refer + SkipBadRecords class.

      +

      This feature can be used when map tasks crash deterministically + on certain input. This usually happens due to bugs in the + map function. Usually, the user would have to fix these bugs. + This is, however, not possible sometimes. The bug may be in third + party libraries, for example, for which the source code is not + available. In such cases, the task never completes successfully even + after multiple attempts, and the job fails. With this feature, only + a small portion of data surrounding the + bad records is lost, which may be acceptable for some applications + (those performing statistical analysis on very large data, for + example).

      +

      By default this feature is disabled. For enabling it, + refer to SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) and SkipBadRecords.setReducerMaxSkipGroups(Configuration, long).

      -

      The skipping mode gets kicked off after certain no of failures +

      With this feature enabled, the framework gets into 'skipping + mode' after a certain number of map failures. For more details, see - SkipBadRecords.setAttemptsToStartSkipping(Configuration, int). -

      -

      In the skipping mode, the map/reduce task maintains the record - range which is getting processed at all times. For maintaining this - range, the framework relies on the processed record - counter. see + SkipBadRecords.setAttemptsToStartSkipping(Configuration, int). + In 'skipping mode', map tasks maintain the range of records being + processed. To do this, the framework relies on the processed record + counter. See SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS and SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPS. - Based on this counter, the framework knows that how - many records have been processed successfully by mapper/reducer. - Before giving the - input to the map/reduce function, it sends this record range to the - Task tracker. If task crashes, the Task tracker knows which one was - the last reported range. On further attempts that range get skipped. -

      -

      The number of records skipped for a single bad record depends on - how frequent, the processed counters are incremented by the application. - It is recommended to increment the counter after processing every - single record. However in some applications this might be difficult as - they may be batching up their processing. In that case, the framework - might skip more records surrounding the bad record. If users want to - reduce the number of records skipped, then they can specify the - acceptable value using + This counter enables the framework to know how many records have + been processed successfully, and hence, what record range caused + a task to crash. On further attempts, this range of records is + skipped.

      +

      The number of records skipped depends on how frequently the + processed record counter is incremented by the application. + It is recommended that this counter be incremented after every + record is processed. This may not be possible in some applications + that typically batch their processing. In such cases, the framework + may skip additional records surrounding the bad record. Users can + control the number of skipped records through SkipBadRecords.setMapperMaxSkipRecords(Configuration, long) and SkipBadRecords.setReducerMaxSkipGroups(Configuration, long). - The framework tries to narrow down the skipped range by employing the - binary search kind of algorithm during task re-executions. The skipped - range is divided into two halves and only one half get executed. - Based on the subsequent failure, it figures out which half contains - the bad record. This task re-execution will keep happening till + The framework tries to narrow the range of skipped records using a + binary search-like approach. The skipped range is divided into two + halves and only one half gets executed. On subsequent + failures, the framework figures out which half contains + bad records. A task will be re-executed till the acceptable skipped value is met or all task attempts are exhausted. To increase the number of task attempts, use @@ -2641,16 +2641,15 @@ JobConf.setMaxReduceAttempts(int).

      -

      The skipped records are written to the hdfs in the sequence file - format, which could be used for later analysis. The location of - skipped records output path can be changed by +

      Skipped records are written to HDFS in the sequence file + format, for later analysis. The location can be changed through SkipBadRecords.setSkipOutputPath(JobConf, Path).

      - +

      Example: WordCount v2.0

      Here is a more complete WordCount which uses many of the @@ -2660,7 +2659,7 @@ pseudo-distributed or fully-distributed Hadoop installation.

      - +

      Source Code

      @@ -3870,7 +3869,7 @@
      - +

      Sample Runs

      Sample text-files as input:

      @@ -4038,7 +4037,7 @@

      - +

      Highlights

      The second version of WordCount improves upon the previous one by using some features offered by the Map/Reduce framework: