hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "TestFaqPage" by SomeOtherAccount
Date Wed, 06 Oct 2010 22:05:19 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "TestFaqPage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/TestFaqPage?action=diff&rev1=2&rev2=3

--------------------------------------------------

   * [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/c++/libhdfs|libhdfs]], a JNI-based
C API for talking to hdfs (only).
   * [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop
Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to write map-reduce
jobs.
  
- <<BR>> <<Anchor(2.2)>> '''2. [[#A2.2|What is the Distributed Cache
used for?]]'''
+ == What is the Distributed Cache used for? ==
  
  The distributed cache is used to distribute large read-only files that are needed by map/reduce
jobs to the cluster. The framework will copy the necessary files from a url (either hdfs:
or http:) on to the slave node before any tasks for the job are executed on that node. The
files are only copied once per job and so should not be modified by the application.
  
- <<BR>> <<Anchor(2.3)>> '''3. [[#A2.3|Can I write create/write-to
hdfs files directly from my map/reduce tasks?]]'''
+ == Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
  
  Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
  
  Caveats:
  
- <glossary>
- 
  ${mapred.output.dir} is the eventual output directory for the job ([[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]]
/ [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath()|JobConf.getOutputPath]]).
  
  ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0),
a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000).
  
- </glossary>
- 
  With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same
TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the
app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0)
per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't
create/write-to files directly via reduce tasks.)
  
  To get around this the framework helps the application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output
of the reduce task-attempt goes. On successful completion of the task-attempt the files in
the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}.
Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
@@ -125, +121 @@

  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since
output of the map, in that case, goes directly to hdfs.
  
- <<BR>> <<Anchor(2.4)>> '''4. [[#A2.4|How do I get each of my maps
to work on one complete input-file and not allow the framework to split-up my files?]]'''
+ == How do I get each of my maps to work on one complete input-file and not allow the framework
to split-up my files? ==
  
  Essentially a job's input is represented by the [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base
class).
  
@@ -137, +133 @@

  
  The other, quick-fix option, is to set [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size|mapred.min.split.size]]
to large enough value.
  
- <<BR>> <<Anchor(2.5)>> '''5. [[#A2.5|Why I do see broken images
in jobdetails.jsp page?]]'''
+ == Why I do see broken images in jobdetails.jsp page? ==
  
  In hadoop-0.15, Map / Reduce task completion graphics are added. The graphs are produced
as SVG(Scalable Vector Graphics) images, which are basically xml files, embedded in html content.
The graphics are tested successfully in Firefox 2 on Ubuntu and MAC OS. However for other
browsers, one should install an additional plugin to the browser to see the SVG images. Adobe's
SVG Viewer can be found at http://www.adobe.com/svg/viewer/install/.
  
- <<BR>> <<Anchor(2.6)>> '''6. [[#A2.6|I see a maximum of 2 maps/reduces
spawned concurrently on each TaskTracker, how do I increase that?]]'''
+ == I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker, how do I
increase that? ==
  
  Use the configuration knob: [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum|mapred.tasktracker.map.tasks.maximum]]
and [[http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.reduce.tasks.maximum|mapred.tasktracker.reduce.tasks.maximum]]
to control the number of maps/reduces spawned simultaneously on a !TaskTracker. By default,
it is set to ''2'', hence one sees a maximum of 2 maps and 2 reduces at a given instance on
a !TaskTracker.
  
  You can set those on a per-tasktracker basis to accurately reflect your hardware (i.e. set
those to higher nos. on a beefier tasktracker etc.).
  
- <<BR>> <<Anchor(2.7)>> '''7. [[#A2.7|Submitting map/reduce jobs
as a different user doesn't work.]]'''
+ == Submitting map/reduce jobs as a different user doesn't work. ==
  
  The problem is that you haven't configured your map/reduce system   directory to a fixed
value. The default works for single node systems, but not for   "real" clusters. I like to
use:
  
@@ -159, +155 @@

     </description>
  </property>
  }}}
- Note that this directory is in your default file system and must be   accessible from both
the client and server machines and is typically   in HDFS.
+ Note that this directory is in your default file system and must be   accessible from both
the client and server machines and is typically in HDFS.
  
- <<BR>> <<Anchor(2.8)>> '''8. [[#A2.8|How do Map/Reduce InputSplit's
handle record boundaries correctly?]]'''
+ == How do Map/Reduce InputSplit's handle record boundaries correctly? ==
  
  It is the responsibility of the InputSplit's RecordReader to start and end at a record boundary.
For SequenceFile's every 2k bytes has a 20 bytes '''sync''' mark between the records. These
sync marks allow the RecordReader to seek to the start of the InputSplit, which contains a
file, offset and length and find the first sync mark after the start of the split. The RecordReader
continues processing records until it reaches the first sync mark after the end of the split.
The first split of each file naturally starts immediately and not after the first sync mark.
In this way, it is guaranteed that each record will be processed by exactly one mapper.
  
  Text files are handled similarly, using newlines instead of sync marks.
  
- <<BR>> <<Anchor(2.9)>> '''9. [[#A2.9|How do I change final output
file name with the desired name rather than in partitions like part-00000, part-00001 ?]]'''
+ == How do I change final output file name with the desired name rather than in partitions
like part-00000, part-00001? ==
  
  You can subclass the [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/OutputFormat.java?view=markup|OutputFormat.java]]
class and write your own. You can look at the code of [[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java?view=markup|TextOutputFormat]]
[[http://svn.apache.org/viewvc/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MultipleOutputFormat.java?view=markup|MultipleOutputFormat.java]]
etc. for reference. It might be the case that you only need to do minor changes to any of
the existing Output Format classes. To do that you can just subclass that class and override
the methods you need to change.
  
- <<BR>> <<Anchor(2.10)>> ''10. [[#A2.10|When writing a New InputFormat,
what is the format for the array of string returned by InputSplit\#getLocations()?]]''
+ == When writing a New InputFormat, what is the format for the array of string returned by
InputSplit\#getLocations()? ==
  
  It appears that DatanodeID.getHost() is the standard place to retrieve this name, and the
machineName variable, populated in DataNode.java\#startDataNode, is where the name is first
set. The first method attempted is to get "slave.host.name" from the configuration; if that
is not available, DNS.getDefaultHost is used instead.
  
- <<BR>> <<Anchor(2.11)>> '''11. [[#A2.11|How do you gracefully stop
a running job?]]'''
+ == How do you gracefully stop a running job? ==
  
+ {{{
  hadoop job -kill JOBID
+ }}}
  
+ = HDFS =
- 
- <<BR>> <<Anchor(3)>> [[#A3|HDFS]]
  
  <<BR>> <<Anchor(3.1)>> '''1. [[#A3.1|If I add new data-nodes to
the cluster will HDFS move the blocks to the newly added nodes in order to balance disk space
utilization between the nodes?]]'''
  

Mime
View raw message