hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "FAQ" by ChristophSchmitz
Date Wed, 20 Apr 2011 09:39:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "FAQ" page has been changed by ChristophSchmitz.
http://wiki.apache.org/hadoop/FAQ?action=diff&rev1=95&rev2=96

--------------------------------------------------

  $ bin/hadoop-daemon.sh start datanode
  $ bin/hadoop-daemon.sh start tasktracker
  }}}
- 
  If you are using the dfs.include/mapred.include functionality, you will need to additionally
add the node to the dfs.include/mapred.include file, then issue {{{hadoop dfsadmin -refreshNodes}}}
and {{{hadoop mradmin -refreshNodes}}} so that the NameNode and JobTracker know of the additional
node that has been added.
  
  == Is there an easy way to see the status and health of a cluster? ==
@@ -92, +91 @@

   * [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html|Hadoop
Pipes]], a [[http://www.swig.org/|SWIG]]-compatible  C++ API (non-JNI) to write map-reduce
jobs.
  
  == How do I submit extra content (jars, static files, etc) for my job to use during runtime?
==
- 
  The [[http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html|distributed
cache]] feature is used to distribute large read-only files that are needed by map/reduce
jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs:
or http:) on to the slave node before any tasks for the job are executed on that node. The
files are only copied once per job and so should not be modified by the application.
  
  For streaming, see the HadoopStreaming wiki for more information.
@@ -101, +99 @@

  
  == How do I get my MapReduce Java Program to read the Cluster's set configuration and not
just defaults? ==
  The configuration property files ({core|mapred|hdfs}-site.xml) that are available in the
various '''conf/''' directories of your Hadoop installation needs to be on the '''CLASSPATH'''
of your Java application for it to get found and applied. Another way of ensuring that no
set configuration gets overridden by any Job is to set those properties as final; for example:
+ 
  {{{
  <name>mapreduce.task.io.sort.mb</name>
  <value>400</value>
  <final>true</final>
  }}}
- 
  Setting configuration properties as final is a common thing Administrators do, as is noted
in the [[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html|Configuration]]
API docs.
  
  A better alternative would be to have a service serve up the Cluster's configuration to
you upon request, in code. [[HADOOP-5670|https://issues.apache.org/jira/browse/HADOOP-5670]]
may be of some interest in this regard, perhaps.
@@ -122, +120 @@

  
  With ''speculative-execution'' '''on''', one could face issues with 2 instances of the same
TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the
app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0)
per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't
create/write-to files directly via reduce tasks.)
  
- To get around this the framework helps the application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output
of the reduce task-attempt goes. On successful completion of the task-attempt the files in
the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}.
Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
+ To get around this the framework helps the application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each reduce task-attempt on hdfs where the
output of the reduce task-attempt goes. On successful completion of the task-attempt the files
in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}.
Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
  
  The application-writer can take advantage of this by creating any side-files required in
${mapred.output.dir} during execution of his reduce-task, and the framework will move them
out similarly - thus you don't have to pick unique paths per task-attempt.
  
- Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt
is actually ${mapred.output.dir}/_{$taskid}, not the value set by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]].
''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to
take advantage of this feature.''
+ Fine-print: the value of ${mapred.output.dir} during execution of a particular ''reduce''
task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)|JobConf.setOutputPath]].
''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to
take advantage of this feature.''
+ 
+ For ''map'' task attempts, the automatic substitution of ${mapred.output.dir}/_${taskid}
for''' '''${mapred.output.dir} does not take place. You can still access the map task attempt
directory, though, by using FileOutputFormat.getWorkOutputPath(TaskInputOutputContext). Files
created there will be dealt with as described above.
  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since
output of the map, in that case, goes directly to hdfs.
  
@@ -281, +281 @@

  = Platform Specific =
  == Mac OS X ==
  === Building on Mac OS X 10.6 ===
- 
  Be aware that Apache Hadoop 0.22 and earlier require Apache Forrest to build the documentation.
 As of Snow Leopard, Apple no longer ships Java 1.5 which Apache Forrest requires.  This can
be accomplished by either copying /System/Library/Frameworks/JavaVM.Framework/Versions/1.5
and 1.5.0 from a 10.5 machine or using a utility like Pacifist to install from an official
Apple package. http://chxor.chxo.com/post/183013153/installing-java-1-5-on-snow-leopard provides
some step-by-step directions.
- 
  
  == Solaris ==
  === Why do files and directories show up as DrWho and/or user names are missing/weird? ===

Mime
View raw message