hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "FAQ" by Arun C Murthy
Date Sat, 22 Sep 2007 10:17:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/lucene-hadoop/FAQ

The comment on the change is:
Made each FAQ a link to facilitate easy sharing via emails etc.

------------------------------------------------------------------------------
  = Hadoop FAQ =
  
- == 1. What is Hadoop? ==
+ [[BR]]
+ [[Anchor(1)]]
+ '''1. [#1 What is Hadoop?]'''
  
  [http://lucene.apache.org/hadoop/ Hadoop] is a distributed computing platform written in
Java.  It incorporates features similar to those of the [http://en.wikipedia.org/wiki/Google_File_System
Google File System] and of [http://en.wikipedia.org/wiki/MapReduce MapReduce].  For some details,
see HadoopMapReduce.
  
+ [[BR]]
+ [[Anchor(2)]]
- == 2. What platform does Hadoop run on? ==
+ '''2. [#2 What platform does Hadoop run on?]'''
  
    1. Java 1.5.x or higher, preferably from Sun
    2. Linux and Windows are the supported operating systems, but BSD and Mac OS/X are known
to work. (Windows requires the installation of [http://www.cygwin.com/ Cygwin]).
  
+ 
+ [[Anchor(2.1)]]
- === 2.1 Building / Testing Hadoop on Windows ===
+ ''2.1 [#2.1 Building / Testing Hadoop on Windows]''
  
  The Hadoop build on Windows can be run from inside a Windows (not cygwin) command prompt
window.
  
@@ -31, +37 @@

  
  other targets work similarly. I just wanted to document this because I spent some time trying
to figure out why the ant build would not run from a cygwin command prompt window. If you
are building/testing on Windows, and haven't figured it out yet, this should get you started.
  
+ 
+ [[BR]]
+ [[Anchor(3)]]
- == 3. How well does Hadoop scale? ==
+ '''3. [#3 How well does Hadoop scale?]'''
  
  Hadoop has been demonstrated on clusters of up to 2000 nodes.  Sort performance on 900 nodes
is good (sorting 9TB of data on 900 nodes takes around 2.25 hours) and [attachment:sort900-20070815.png
improving] using these non-default configuration values:
  
@@ -50, +59 @@

    * `tasktracker.http.threads = 50`
    * `mapred.child.java.opts = -Xmx1024m`
  
+ 
+ [[BR]]
+ [[Anchor(4)]]
- == 4. Do I have to write my application in Java? ==
+ '''4. [#4 Do I have to write my application in Java?]'''
  
  No.  There are several ways to incorporate non-Java code.  
    * HadoopStreaming permits any shell command to be used as a map or reduce function. 
    * [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs libhdfs], a JNI-based
C API for talking to hdfs (only).
    * [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/pipes/package-summary.html
Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible  C++ API (non-JNI) to write map-reduce
jobs.
  
+ 
+ [[BR]]
+ [[Anchor(5)]]
- == 5. How can I help to make Hadoop better? ==
+ '''5. [#5 How can I help to make Hadoop better?]'''
  
  If you have trouble figuring how to use Hadoop, then, once you've figured something out
(perhaps with the help of the [http://lucene.apache.org/hadoop/mailing_lists.html mailing
lists]), pass that knowledge on to others by adding something to this wiki.
  
  If you find something that you wish were done better, and know how to fix it, read HowToContribute,
and contribute a patch.
  
+ 
+ [[BR]]
+ [[Anchor(6)]]
- == 6. If I add new data-nodes to the cluster will HDFS move the blocks to the newly added
nodes in order to balance disk space utilization between the nodes? ==
+ '''6. [#6 If I add new data-nodes to the cluster will HDFS move the blocks to the newly
added nodes in order to balance disk space utilization between the nodes?]'''
  
  No, HDFS will not move blocks to new nodes automatically. However, newly created files will
likely have their blocks placed on the new nodes.
  
@@ -72, +90 @@

   2. A simpler way, with no interruption of service, is to turn up the replication of files,
wait for transfers to stabilize, and then turn the replication back down.
   3. Yet another way to re-balance blocks is to turn off the data-node, which is full, wait
until its blocks are replicated, and then bring it back again. The over-replicated blocks
will be randomly removed from different nodes, so you really get them rebalanced not just
removed from the current node.
  
+ 
+ [[BR]]
+ [[Anchor(7)]]
- == 7. What is the purpose of the secondary name-node? ==
+ '''7. [#7 What is the purpose of the secondary name-node?]'''
  
  The term "secondary name-node" is somewhat misleading.
  It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node,
@@ -91, +112 @@

  You will also need to restart the whole cluster in this case.
  
  
+ [[BR]]
+ [[Anchor(8)]]
- == 8. What is the Distributed Cache used for? ==
+ '''8. [#8 What is the Distributed Cache used for?]'''
  
  The distributed cache is used to distribute large read-only files that are needed by map/reduce
jobs to the cluster. The framework will copy the necessary files from a url (either hdfs:
or http:) on to the slave node before any tasks for the job are executed on that node. The
files are only copied once per job and so should not be modified by the application.
  
+ 
+ [[BR]]
+ [[Anchor(9)]]
- == 9. Can I write create/write-to hdfs files directly from my map/reduce tasks? ==
+ '''9. [#9 Can I write create/write-to hdfs files directly from my map/reduce tasks?]'''
  
  Yes. (Clearly, you want this since you need to create/write-to files other than the output-file
written out by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html
OutputCollector].)
  
@@ -113, +139 @@

  
  To get around this the framework helps the application-writer out by maintaining a special
'''${mapred.output.dir}/_${taskid}''' sub-dir for each task-attempt on hdfs where the output
of the reduce task-attempt goes. On successful completion of the task-attempt the files in
the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}.
Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is
completely transparent to the application.
  
- The app-writer can take advantage of this by creating any side-files required in ${mapred.output.dir}
during execution of his reduce-task, and the framework will move them out similarly - thus
you don't have to pick unique paths per task-attempt.
+ The application-writer can take advantage of this by creating any side-files required in
${mapred.output.dir} during execution of his reduce-task, and the framework will move them
out similarly - thus you don't have to pick unique paths per task-attempt.
  
  Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt
is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path)
JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir}
from your reduce task to take advantage of this feature.''
  

Mime
View raw message