hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "HadoopMapReduce" by TeppoKurki
Date Wed, 19 Apr 2006 05:02:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by TeppoKurki:
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce

------------------------------------------------------------------------------
  == Map ==
  
  As the Map operation is parallelized the input file set is first
- split to several pieces called FileSplits. If an individual file
+ split to several pieces called !FileSplits. If an individual file
  is so large that it will affect seek time it will be split to
  several Splits. The splitting does not know anything about the
  input file's internal logical structure, for example
  line-oriented text files are split on arbitrary byte boundaries.
  Then a new !MapTask is created per !FileSplit.
  
- When an individual MapTask task starts it will open a new output
+ When an individual !MapTask task starts it will open a new output
  writer per configured Reduce task. It will then proceed to read
- its FileSplit using the RecordReader it gets from the specified
+ its !FileSplit using the !RecordReader it gets from the specified
- InputFormat. InputFormat parses the input and generates
+ InputFormat. !InputFormat parses the input and generates
- key-value pairs. It is not necessary for the InputFormat to
+ key-value pairs. It is not necessary for the !InputFormat to
- generate both meaningful keys and values. For example the
+ generate both "meaningful" keys and values. For example the
- default TextInputFormat's output consists of input lines as
+ default !TextInputFormat's output consists of input lines as
  values and somewhat meaninglessly line start file offsets as
  values - most applications only use the lines and ignore the
  offsets.
  
- As key-value pairs are read from the RecordReader they are
+ As key-value pairs are read from the !RecordReader they are
  passed to the configured Mapper. The user supplied Mapper does
  whatever it wants with the input pair and calls	[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable)
OutputCollector.collect] with key-value pairs of its own choosing. The output it
  generates must use one key class and one value class, because
@@ -39, +39 @@

  
  When Mapper output is collected it is partitioned, which means
  that it will be written to the output specified by the
- Partitioner. The default HashPartitioner uses the key value's
+ Partitioner. The default !HashPartitioner uses the key value's
  hashcode (which means that for even workload on the Reduce tasks
  the key class hashCode must be good).
  
@@ -62, +62 @@

  the combine operation as they were created by the original map
  operation.
  
- For example a word count MapReduce application whose Map
+ For example a word count !MapReduce application whose Map
  operation outputs (word, 1) pairs as words are encountered in
  the input can use a combiner to speed up processing. A combine
  operation will start gathering the output in in-memory (instead
@@ -71, +71 @@

  unique word with the list available as an iterator. The combiner
  then emits (word, count-in-this-part-of-the-input) pairs. From
  the viewpoint of the Reduce operation this contains the same
- information as the original Map output, but there will be a lot
+ information as the original Map output, but there might be a lot
  less bits to output to disk and read from disk.
  
  == Reduce ==
@@ -89, +89 @@

  
  In the end the output will consist of one output file per Reduce
  task run. The format of the files can be specified with
- [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)
JobConf.setOutputFormat]. If SequentialOutputFormat is used the output Key and Value
+ [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)
JobConf.setOutputFormat]. If !SequentialOutputFormat is used the output Key and Value
  classes must also be specified.
  

Mime
View raw message