hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "TestFaqPage" by SomeOtherAccount
Date Wed, 06 Oct 2010 22:01:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "TestFaqPage" page has been changed by SomeOtherAccount.
http://wiki.apache.org/hadoop/TestFaqPage?action=diff&rev1=1&rev2=2

--------------------------------------------------

- = Hadoop FAQ =
+ #pragma section-numbers on
  
+ '''Hadoop FAQ'''
- [[#A1|General]]
- [[#A2|MapReduce]]
- [[#A3|HDFS]]
- [[#A4|Platform Specific]]
  
- <<BR>> <<Anchor(1)>> [[#A1|General]]
+ <<TableOfContents(3)>>
  
- <<BR>> <<Anchor(1.1)>> '''1. [[#A1.1|What is Hadoop?]]'''
+ = General =
+ 
+ == What is Hadoop? ==
  
  [[http://hadoop.apache.org/core/|Hadoop]] is a distributed computing platform written in
Java.  It incorporates features similar to those of the [[http://en.wikipedia.org/wiki/Google_File_System|Google
File System]] and of [[http://en.wikipedia.org/wiki/MapReduce|MapReduce]].  For some details,
see HadoopMapReduce.
  
- <<BR>> <<Anchor(1.2)>> '''2. [[#A1.2|What platform does Hadoop run
on?]]'''
+ == What platform does Hadoop run on? ==
  
   1. Java 1.6.x or higher, preferably from Sun -see HadoopJavaVersions
   1. Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris
are known to work. (Windows requires the installation of [[http://www.cygwin.com/|Cygwin]]).
  
- <<BR>> <<Anchor(1.3)>> '''3. [[#A1.3|How well does Hadoop scale?]]'''
+ == How well does Hadoop scale? ==
  
  Hadoop has been demonstrated on clusters of up to 4000 nodes.  Sort performance on 900 nodes
is good (sorting 9TB of data on 900 nodes takes around 1.8 hours) and [[attachment:sort900-20080115.png|improving]]
using these non-default configuration values:
  
@@ -38, +37 @@

   * `tasktracker.http.threads = 50`
   * `mapred.child.java.opts = -Xmx1024m`
  
- <<BR>> <<Anchor(1.4)>> '''4. [[#A1.4|What kind of hardware scales
best for Hadoop?]]'''
+ == What kind of hardware scales best for Hadoop? ==
  
  The short answer is dual processor/dual core machines with 4-8GB of RAM using ECC memory.
Machines should be moderately high-end commodity machines to be most cost-effective and typically
cost 1/2 - 2/3 the cost of normal production application servers but are not desktop-class
machines. This cost tends to be $2-5K. For a more detailed discussion, see MachineScaling
page.
  
- <<BR>> <<Anchor(1.5)>> '''5. [[#A1.5|How does GridGain compare to
Hadoop?]]'''
+ == How does GridGain compare to Hadoop? ==
  
  !GridGain does not support data intensive jobs. For more details, see HadoopVsGridGain.
  
- <<BR>> <<Anchor(1.6)>> '''6. [[#A1.6|I have a new node I want to
add to a running Hadoop cluster; how do I start services on just one node?]]'''
+ == I have a new node I want to add to a running Hadoop cluster; how do I start services
on just one node? ==
  
  This also applies to the case where a machine has crashed and rebooted, etc, and you need
to get it to rejoin the cluster. You do not need to shutdown and/or restart the entire cluster
in this case.
  
@@ -59, +58 @@

  $ bin/hadoop-daemon.sh start datanode
  $ bin/hadoop-daemon.sh start tasktracker
  }}}
- <<BR>> <<Anchor(1.7)>> '''7. [[#A1.7|Is there an easy way to see
the status and health of my cluster?]]'''
+ 
+ == Is there an easy way to see the status and health of my cluster? ==
  
  There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS
master) which display status pages about the state of the entire system. By default, these
are located at http://job.tracker.addr:50030/ and http://name.node.addr:50070/.
  
@@ -71, +71 @@

  $ bin/hadoop dfsadmin -report
  }}}
  
- <<BR>> <<Anchor(1.8)>> '''8. [[#A1.8|How much network bandwidth
might I need between racks in a medium size (40-80 node) Hadoop cluster?]]'''
+ == How much network bandwidth might I need between racks in a medium size (40-80 node) Hadoop
cluster? ==
  
  The true answer depends on the types of jobs you're running. As a back of the envelope calculation
one might figure something like this:
  
@@ -81, +81 @@

  
  So, the simple answer is that 4-6Gbps is most likely just fine for most practical jobs.
If you want to be extra safe, many inexpensive switches can operate in a "stacked" configuration
where the bandwidth between them is essentially backplane speed. That should scale you to
96 nodes with plenty of headroom. Many inexpensive gigabit switches also have one or two 10GigE
ports which can be used effectively to connect to each other or to a 10GE core.
  
- <<BR>> <<Anchor(1.9)>> '''9. [[#A1.9|How can I help to make Hadoop
better?]]'''
+ == How can I help to make Hadoop better? ==
  
  If you have trouble figuring how to use Hadoop, then, once you've figured something out
(perhaps with the help of the [[http://hadoop.apache.org/core/mailing_lists.html|mailing lists]]),
pass that knowledge on to others by adding something to this wiki.
  
  If you find something that you wish were done better, and know how to fix it, read HowToContribute,
and contribute a patch.
  
- <<BR>> <<Anchor(2)>> [[#A2|MapReduce]]
+ = MapReduce =
  
- <<BR>> <<Anchor(2.1)>> '''1. [[#A2.1|Do I have to write my application
in Java?]]'''
+ == Do I have to write my application in Java? ==
  
  No.  There are several ways to incorporate non-Java code.
  
@@ -252, +252 @@

  bin/hadoop dfs -ls 'in*'
  }}}
  
- <<BR>> <<Anchor(3.8)>> '''8. [[#A3.8|Can I have multiple files in
HDFS use different block sizes?]]'''
+ == Can I have multiple files in HDFS use different block sizes? ==
  
  Yes. HDFS provides api to specify block size when you create a file. <<BR>>
See [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20boolean,%20int,%20short,%20long)|FileSystem.create(Path,
overwrite, bufferSize, replication, blockSize, progress)]]
  
- <<BR>> <<Anchor(3.9)>> '''9. [[#A3.9|Does HDFS make block boundaries
between records?]]'''
+ == Does HDFS make block boundaries between records? ==
  
  No, HDFS does not provide record-oriented API and therefore is not aware of records and
boundaries between them.
  
- <<BR>> <<Anchor(3.10)>> '''10. [[#A3.10|HWhat happens when two clients
try to write into the same HDFS file?]]'''
+ == What happens when two clients try to write into the same HDFS file? ==
  
  HDFS supports exclusive writes only. <<BR>> When the first client contacts the
name-node to open the file for writing, the name-node grants a lease to the client to create
this file.  When the second client tries to open the same file for writing, the name-node
 will see that the lease for the file is already granted to another client, and will reject
the open request for the second client.
  
- <<BR>> <<Anchor(3.11)>> '''11. [[#A3.11|How to limit Data node's
disk usage?]]'''
+ == How to limit Data node's disk usage? ==
  
  Use dfs.datanode.du.reserved configuration value in $HADOOP_HOME/conf/hdfs-site.xml for
limiting disk usage.
  
@@ -278, +278 @@

    </property>
  }}}
  
- <<BR>> <<Anchor(3.12)>> '''12. [[#A3.12|On an individual data node,
how do you balance the blocks on the disk?]]'''
+ == On an individual data node, how do you balance the blocks on the disk? ==
  
  Hadoop currently does not have a method by which to do this automatically.  To do this manually:
  
@@ -286, +286 @@

   2. Use the UNIX mv command to move the individual blocks and meta pairs from one directory
to another on each host
   3. Restart the HDFS
  
+ == What does "file could only be replicated to 0 nodes, instead of 1" mean? ==
- 
- <<BR>> <<Anchor(3.13)>> '''13. [[#A3.13|What does "file could only
be replicated to 0 nodes, instead of 1" mean?]]'''
  
  The NameNode does not have any available DataNodes.  This can be caused by a wide variety
of reasons.  Check the DataNode logs, the NameNode logs, network connectivity, ...
  
+ = Platform Specific =
+ == Windows ==
  
+ === Building / Testing Hadoop on Windows ===
- <<BR>> <<Anchor(4)>> [[#A4|Platform Specific]]
- <<BR>> <<Anchor(4.1)>> [[#A4.1|Windows]]
- 
- <<BR>> <<Anchor(4.1.1)>> '''1. [[#A4.1.1|Building / Testing Hadoop
on Windows]]'''
  
  The Hadoop build on Windows can be run from inside a Windows (not cygwin) command prompt
window.
  

Mime
View raw message