hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "FAQ" by DougCutting
Date Wed, 13 Feb 2008 17:20:54 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by DougCutting:
http://wiki.apache.org/hadoop/FAQ

The comment on the change is:
update for TLP move

------------------------------------------------------------------------------
  [[Anchor(1)]]
  '''1. [#1 What is Hadoop?]'''
  
- [http://lucene.apache.org/hadoop/ Hadoop] is a distributed computing platform written in
Java.  It incorporates features similar to those of the [http://en.wikipedia.org/wiki/Google_File_System
Google File System] and of [http://en.wikipedia.org/wiki/MapReduce MapReduce].  For some details,
see HadoopMapReduce.
+ [http://hadoop.apache.org/core/ Hadoop] is a distributed computing platform written in Java.
 It incorporates features similar to those of the [http://en.wikipedia.org/wiki/Google_File_System
Google File System] and of [http://en.wikipedia.org/wiki/MapReduce MapReduce].  For some details,
see HadoopMapReduce.
  
  [[BR]]
  [[Anchor(2)]]
@@ -67, +67 @@

  
  No.  There are several ways to incorporate non-Java code.  
    * HadoopStreaming permits any shell command to be used as a map or reduce function. 
-   * [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs libhdfs], a JNI-based
C API for talking to hdfs (only).
+   * [http://svn.apache.org/viewvc/hadoop/core/trunk/src/c%2B%2B/libhdfs libhdfs], a JNI-based
C API for talking to hdfs (only).
-   * [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/pipes/package-summary.html
Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible  C++ API (non-JNI) to write map-reduce
jobs.
+   * [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html
Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible  C++ API (non-JNI) to write map-reduce
jobs.
  
  
  [[BR]]
  [[Anchor(5)]]
  '''5. [#5 How can I help to make Hadoop better?]'''
  
- If you have trouble figuring how to use Hadoop, then, once you've figured something out
(perhaps with the help of the [http://lucene.apache.org/hadoop/mailing_lists.html mailing
lists]), pass that knowledge on to others by adding something to this wiki.
+ If you have trouble figuring how to use Hadoop, then, once you've figured something out
(perhaps with the help of the [http://hadoop.apache.org/core/mailing_lists.html mailing lists]),
pass that knowledge on to others by adding something to this wiki.
  
  If you find something that you wish were done better, and know how to fix it, read HowToContribute,
and contribute a patch.
  
@@ -124, +124 @@

  [[Anchor(9)]]
  '''9. [#9 MR. Can I write create/write-to hdfs files directly from my map/reduce tasks?]'''
  
- Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html
OutputCollector].)
+ Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html
OutputCollector].)
  
  Caveats:
  
  <glossary>
  
- ${mapred.output.dir} is the eventual output directory for the job ([http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath] / [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()
JobConf.getOutputPath]).
+ ${mapred.output.dir} is the eventual output directory for the job ([http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath] / [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()
JobConf.getOutputPath]).
  
  ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0),
a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
  
@@ -142, +142 @@

  
  The application-writer can take advantage of this by creating any side-files required in
${mapred.output.dir} during execution of his reduce-task, and the framework will move them
out similarly - thus you don't have to pick unique paths per task-attempt.
  
- Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt
is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir}
from your reduce task to take advantage of this feature.''
+ Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt
is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir}
from your reduce task to take advantage of this feature.''
  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since
output of the map, in that case, goes directly to hdfs.
  
@@ -151, +151 @@

  [[Anchor(10)]]
  '''10. [#10 MR. How do I get each of my maps to work on one complete input-file and not
allow the framework to split-up my files?]'''
  
- Essentially a job's input is represented by the [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/InputFormat.html
InputFormat](interface)/[http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html
FileInputFormat](base class).
+ Essentially a job's input is represented by the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
InputFormat](interface)/[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html
FileInputFormat](base class).
  
- For this purpose one would need a 'non-splittable' [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html
FileInputFormat] i.e. an input-format which essentially tells the map-reduce framework that
it cannot be split-up and processed. To do this you need your particular input-format to return
'''false''' for the [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path)
isSplittable] call.
+ For this purpose one would need a 'non-splittable' [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html
FileInputFormat] i.e. an input-format which essentially tells the map-reduce framework that
it cannot be split-up and processed. To do this you need your particular input-format to return
'''false''' for the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path)
isSplittable] call.
   
- E.g. '''org.apache.hadoop.mapred.Sort``Validator.Record``Stats``Checker.Non``Splitable``Sequence``File``Input``Format'''
in [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/test/org/apache/hadoop/mapred/SortValidator.java
src/test/org/apache/hadoop/mapred/SortValidator.java]
+ E.g. '''org.apache.hadoop.mapred.Sort``Validator.Record``Stats``Checker.Non``Splitable``Sequence``File``Input``Format'''
in [http://svn.apache.org/viewvc/hadoop/core/trunk/src/test/org/apache/hadoop/mapred/SortValidator.java
src/test/org/apache/hadoop/mapred/SortValidator.java]
  
  In addition to implementing the InputFormat interface and having isSplitable(...) returning
false, it is also necessary to implement the RecordReader interface for returning the whole
content of the input file. (default is LineRecordReader, which splits the file into separate
lines)
  
- The other, quick-fix option, is to set [http://lucene.apache.org/hadoop/hadoop-default.html#mapred.min.split.size
mapred.min.split.size] to large enough value.
+ The other, quick-fix option, is to set [http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size
mapred.min.split.size] to large enough value.
  
  
  [[BR]]
@@ -179, +179 @@

  
  Depending on how safe mode parameters are configured the name-node will stay in safe mode

  until a specific percentage of blocks of the system is ''minimally'' replicated 
- [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.replication.min dfs.replication.min].
+ [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.replication.min dfs.replication.min].
  If the safe mode threshold 
- [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.safemode.threshold.pct dfs.safemode.threshold.pct]

+ [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.safemode.threshold.pct
dfs.safemode.threshold.pct] 
  is set to 1 then all blocks of all 
  files should be minimally replicated.
  
@@ -189, +189 @@

  order to replicate them the name-node needs to leave safe mode.
  
  Learn more about safe mode
- [http://lucene.apache.org/hadoop/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction)
   here].
+ [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction)
   here].
  
  
  [[BR]]
  [[Anchor(13)]]
  '''13. [#13 MR. I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker,
how do I increase that?]'''
  
- Use the configuration knob: [http://lucene.apache.org/hadoop/hadoop-default.html#mapred.tasktracker.tasks.maximum
mapred.tasktracker.tasks.maximum] to control the number of maps/reduces spawned simultaneously
on a !TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 maps and
2 reduces at a given instance on a !TaskTracker.
+ Use the configuration knob: [http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.tasks.maximum
mapred.tasktracker.tasks.maximum] to control the number of maps/reduces spawned simultaneously
on a !TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 maps and
2 reduces at a given instance on a !TaskTracker.
  
  Caveats:
     * ''mapred.tasktracker.tasks.maximum'' that is a cluster-wide limit i.e. controlled at
the !JobTracker end. [http://issues.apache.org/jira/browse/HADOOP-1245 HADOOP-1245] should
fix that.
@@ -232, +232 @@

  ''Data-nodes'' can store blocks in multiple directories typically allocated on different
local disk drives.
  In order to setup multiple directories one needs to specify a comma separated list of pathnames
as a value of
  the configuration parameter 
- [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.data.dir dfs.data.dir].
+ [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.data.dir dfs.data.dir].
  Data-nodes will attempt to place equal amount of data in each of the directories.
  
  The ''name-node'' also supports multiple directories, which in the case store the name space
image and the edits log.
  The directories are specified via the 
- [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.name.dir dfs.name.dir]
+ [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.name.dir dfs.name.dir]
  configuration parameter.
  The name-node directories are used for the name space data replication so that the image
and the 
  log could be restored from the remaining volumes if one of them fails. 
@@ -264, +264 @@

  Hadoop offers the ''decommission'' feature to retire a set of existing data-nodes.
  The nodes to be retired should be included into the ''exclude file'', and the exclude file
name should 
  be specified as a configuration parameter
- [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.hosts.exclude dfs.hosts.exclude].
+ [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.hosts.exclude dfs.hosts.exclude].
  Then the shell command
  {{{
  bin/hadoop dfsadmin -refreshNodes

Mime
View raw message