hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "FAQ" by SomeOtherAccount
Date Fri, 22 Oct 2010 15:42:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "FAQ" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/FAQ?action=diff&rev1=81&rev2=82

--------------------------------------------------

  $ bin/hadoop-daemon.sh start tasktracker
  }}}
  
- == Is there an easy way to see the status and health of my cluster? ==
+ == Is there an easy way to see the status and health of a cluster? ==
  
  There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS
master) which display status pages about the state of the entire system. By default, these
are located at http://job.tracker.addr:50030/ and http://name.node.addr:50070/.
  
@@ -87, +87 @@

  
  If you find something that you wish were done better, and know how to fix it, read HowToContribute,
and contribute a patch.
  
- == I am seeing connection refused in my logs.  How do I troubleshoot this? ==
+ == I am seeing connection refused in the logs.  How do I troubleshoot this? ==
  
  See ConnectionRefused .
  
  = MapReduce =
  
- == Do I have to write my application in Java? ==
+ == Do I have to write my job in Java? ==
  
  No.  There are several ways to incorporate non-Java code.
  
@@ -105, +105 @@

  
  The distributed cache is used to distribute large read-only files that are needed by map/reduce
jobs to the cluster. The framework will copy the necessary files from a url (either hdfs:
or http:) on to the slave node before any tasks for the job are executed on that node. The
files are only copied once per job and so should not be modified by the application.
  
- == Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
+ == Can I write create/write-to hdfs files directly from map/reduce tasks? ==
  
  Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html|OutputCollector]].)
  
@@ -125, +125 @@

  
  The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since
output of the map, in that case, goes directly to hdfs.
  
- == How do I get each of my maps to work on one complete input-file and not allow the framework
to split-up my files? ==
+ == How do I get each of a job's maps to work on one complete input-file and not allow the
framework to split-up the files? ==
  
  Essentially a job's input is represented by the [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html|InputFormat]](interface)/[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html|FileInputFormat]](base
class).
  
@@ -181, +181 @@

  hadoop job -kill JOBID
  }}}
  
- == How do I limit the number of concurrent tasks my job may have running total at a time?
==
+ == How do I limit the number of concurrent tasks a job may have running total at a time?
==
  
+ See LimitingTaskSlotUsage.
- Typically when this question is asked, it is because a job is referencing something external
to Hadoop that has some sort of limit on it, such as reading or writing from a database. 
In Hadoop terms, we call this a 'side-effect'.
- 
- One of the general assumptions of the framework is that there are not any side-effects.
All tasks are expected to be restartable and a side-effect typically goes against the grain
of this rule.
- 
- If a task absolutely must break the rules, there are a few things one can do:
- 
- * Deploy ZooKeeper and use it as a persistent lock to keep track of how many tasks are running
concurrently
- * Use a scheduler with a maximum task-per-queue feature and submit the job to that queue
- 
- == How do I limit the number of concurrent tasks my job may have running on a given node
at a time? ==
- 
- The CapacityScheduler in 0.21 has a feature whereby one may use RAM-per-task to limit how
many slots a given task takes.  By careful use of this feature, one may limit how many concurrent
tasks on a given node a job may take. 
  
  = HDFS =
  

Mime
View raw message