hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ni...@apache.org
Subject svn commit: r608952 [1/2] - in /lucene/hadoop/trunk/docs: hod.html hod.pdf
Date Fri, 04 Jan 2008 18:21:35 GMT
Author: nigel
Date: Fri Jan  4 10:21:34 2008
New Revision: 608952

URL: http://svn.apache.org/viewvc?rev=608952&view=rev
Log:
HADOOP-1301.  Hadoop-On-Demand (HOD): resource management provisioning for Hadoop. Contributed
by Hemanth Yamijala.

Added:
    lucene/hadoop/trunk/docs/hod.html
    lucene/hadoop/trunk/docs/hod.pdf

Added: lucene/hadoop/trunk/docs/hod.html
URL: http://svn.apache.org/viewvc/lucene/hadoop/trunk/docs/hod.html?rev=608952&view=auto
==============================================================================
--- lucene/hadoop/trunk/docs/hod.html (added)
+++ lucene/hadoop/trunk/docs/hod.html Fri Jan  4 10:21:34 2008
@@ -0,0 +1,1062 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.7">
+<meta name="Forrest-skin-name" content="pelt">
+<title> 
+      Hadoop On Demand
+    </title>
+<link type="text/css" href="skin/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="skin/screen.css" rel="stylesheet">
+<link media="print" type="text/css" href="skin/print.css" rel="stylesheet">
+<link type="text/css" href="skin/profile.css" rel="stylesheet">
+<script src="skin/getBlank.js" language="javascript" type="text/javascript"></script><script
src="skin/getMenu.js" language="javascript" type="text/javascript"></script><script
src="skin/fontsize.js" language="javascript" type="text/javascript"></script>
+<link rel="shortcut icon" href="images/favicon.ico">
+</head>
+<body onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<div class="breadtrail">
+<a href="http://www.apache.org/">Apache</a> &gt; <a href="http://lucene.apache.org/">Lucene</a>
&gt; <a href="http://lucene.apache.org/hadoop/">Hadoop</a><script src="skin/breadcrumbs.js"
language="JavaScript" type="text/javascript"></script>
+</div>
+<div class="header">
+<div class="grouplogo">
+<a href="http://lucene.apache.org/"><img class="logoImage" alt="Lucene" src="images/lucene_green_150.gif"
title="Apache Lucene"></a>
+</div>
+<div class="projectlogo">
+<a href="http://lucene.apache.org/hadoop/"><img class="logoImage" alt="Hadoop" src="images/hadoop-logo.jpg"
title="Scalable Computing Platform"></a>
+</div>
+<div class="searchbox">
+<form action="http://www.google.com/search" method="get" class="roundtopsmall">
+<input value="lucene.apache.org" name="sitesearch" type="hidden"><input onFocus="getBlank
(this, 'Search the site with google');" size="25" name="q" id="query" type="text" value="Search
the site with google">&nbsp; 
+                    <input attr="value" name="Search" value="Search" type="submit">
+</form>
+</div>
+<ul id="tabs">
+<li>
+<a class="base-not-selected" href="http://lucene.apache.org/hadoop/">Project</a>
+</li>
+<li>
+<a class="base-not-selected" href="http://wiki.apache.org/lucene-hadoop">Wiki</a>
+</li>
+<li class="current">
+<a class="base-selected" href="index.html">Hadoop 0.16 Documentation</a>
+</li>
+</ul>
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<div id="level2tabs"></div>
+<script type="text/javascript"><!--
+document.write("<text>Last Published:</text> " + document.lastModified);
+//  --></script>
+</div>
+<div class="breadtrail">
+             
+             &nbsp;
+           </div>
+<div id="menu">
+<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" id="menu_selected_1.1Title" class="menutitle"
style="background-image: url('skin/images/chapter_open.gif');">Documentation</div>
+<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: block;">
+<div class="menuitem">
+<a href="index.html">Overview</a>
+</div>
+<div class="menuitem">
+<a href="quickstart.html">Quickstart</a>
+</div>
+<div class="menuitem">
+<a href="cluster_setup.html">Cluster Setup</a>
+</div>
+<div class="menuitem">
+<a href="hdfs_design.html">HDFS Architecture</a>
+</div>
+<div class="menuitem">
+<a href="mapred_tutorial.html">Map-Reduce Tutorial</a>
+</div>
+<div class="menuitem">
+<a href="streaming.html">Streaming</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">Hadoop On Demand</div>
+</div>
+<div class="menuitem">
+<a href="api/index.html">API Docs</a>
+</div>
+<div class="menuitem">
+<a href="http://wiki.apache.org/lucene-hadoop/">Wiki</a>
+</div>
+<div class="menuitem">
+<a href="http://wiki.apache.org/lucene-hadoop/FAQ">FAQ</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/hadoop/mailing_lists.html">Mailing Lists</a>
+</div>
+</div>
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" height="15" width="15" alt="" src="skin/images/rc-b-l-15-1body-2menu-3menu.png"></div>
+<div id="credit2"></div>
+</div>
+<div id="content">
+<div title="Portable Document Format" class="pdflink">
+<a class="dida" href="hod.pdf"><img alt="PDF -icon" src="skin/images/pdfdoc.gif"
class="skin"><br>
+        PDF</a>
+</div>
+<h1> 
+      Hadoop On Demand
+    </h1>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#Introduction"> Introduction </a>
+</li>
+<li>
+<a href="#Feature+List"> Feature List </a>
+<ul class="minitoc">
+<li>
+<a href="#Simplified+Interface+for+Provisioning+Hadoop+Clusters"> Simplified Interface
for Provisioning Hadoop Clusters </a>
+</li>
+<li>
+<a href="#Automatic+installation+of+Hadoop"> Automatic installation of Hadoop </a>
+</li>
+<li>
+<a href="#Configuring+Hadoop"> Configuring Hadoop </a>
+</li>
+<li>
+<a href="#Auto-cleanup+of+Unused+Clusters"> Auto-cleanup of Unused Clusters </a>
+</li>
+<li>
+<a href="#Log+Services"> Log Services </a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#HOD+Components"> HOD Components </a>
+<ul class="minitoc">
+<li>
+<a href="#HOD+Client"> HOD Client </a>
+</li>
+<li>
+<a href="#RingMaster"> RingMaster </a>
+</li>
+<li>
+<a href="#HodRing"> HodRing </a>
+</li>
+<li>
+<a href="#Hodrc+%2F+HOD+configuration+file"> Hodrc / HOD configuration file </a>
+</li>
+<li>
+<a href="#Submit+Nodes+and+Compute+Nodes"> Submit Nodes and Compute Nodes </a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Getting+Started+with+HOD"> Getting Started with HOD </a>
+<ul class="minitoc">
+<li>
+<a href="#Pre-Requisites"> Pre-Requisites </a>
+<ul class="minitoc">
+<li>
+<a href="#Hardware"> Hardware </a>
+</li>
+<li>
+<a href="#Software"> Software </a>
+</li>
+<li>
+<a href="#Resource+Manager+Configuration+Pre-requisites">Resource Manager Configuration
Pre-requisites</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Setting+up+HOD">Setting up HOD</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Running+HOD">Running HOD</a>
+<ul class="minitoc">
+<li>
+<a href="#Overview">Overview</a>
+<ul class="minitoc">
+<li>
+<a href="#Operation+allocate">Operation allocate</a>
+</li>
+<li>
+<a href="#Running+Hadoop+jobs+using+the+allocated+cluster">Running Hadoop jobs using
the allocated cluster</a>
+</li>
+<li>
+<a href="#Operation+deallocate">Operation deallocate</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Command+Line+Options">Command Line Options</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#HOD+Configuration"> HOD Configuration </a>
+<ul class="minitoc">
+<li>
+<a href="#Introduction+to+HOD+Configuration"> Introduction to HOD Configuration </a>
+</li>
+<li>
+<a href="#Categories+%2F+Sections+in+HOD+Configuration"> Categories / Sections in HOD
Configuration </a>
+</li>
+<li>
+<a href="#Important+and+Commonly+Used+Configuration+Options"> Important and Commonly
Used Configuration Options </a>
+<ul class="minitoc">
+<li>
+<a href="#Common+configuration+options"> Common configuration options </a>
+</li>
+<li>
+<a href="#hod+options"> hod options </a>
+</li>
+<li>
+<a href="#resource_manager+options"> resource_manager options </a>
+</li>
+<li>
+<a href="#ringmaster+options"> ringmaster options </a>
+</li>
+<li>
+<a href="#gridservice-hdfs+options"> gridservice-hdfs options </a>
+</li>
+<li>
+<a href="#gridservice-mapred+options"> gridservice-mapred options </a>
+</li>
+</ul>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+    
+<a name="N1000C"></a><a name="Introduction"></a>
+<h2 class="h3"> Introduction </h2>
+<div class="section">
+<p>
+      The Hadoop On Demand (<acronym title="Hadoop On Demand">HOD</acronym>)
project is a system for provisioning and managing independent Hadoop MapReduce instances on
a shared cluster of nodes. HOD uses a resource manager for allocation. At present it supports
<a href="http://www.clusterresources.com/pages/products/torque-resource-manager.php">Torque</a>
out of the box.
+      </p>
+</div>
+
+    
+<a name="N1001E"></a><a name="Feature+List"></a>
+<h2 class="h3"> Feature List </h2>
+<div class="section">
+<a name="N10024"></a><a name="Simplified+Interface+for+Provisioning+Hadoop+Clusters"></a>
+<h3 class="h4"> Simplified Interface for Provisioning Hadoop Clusters </h3>
+<p>
+        By far, the biggest advantage of HOD is to quickly setup a Hadoop cluster. The user
interacts with the cluster through a simple command line interface, the HOD client. HOD brings
up a virtual MapReduce cluster with the required number of nodes, which the user can use for
running Hadoop jobs. When done, HOD will automatically clean up the resources and make the
nodes available again.
+        </p>
+<a name="N1002E"></a><a name="Automatic+installation+of+Hadoop"></a>
+<h3 class="h4"> Automatic installation of Hadoop </h3>
+<p>
+        With HOD, Hadoop does not need to be even installed on the cluster. The user can
provide a Hadoop tarball that HOD will automatically distribute to all the nodes in the cluster.
+        </p>
+<a name="N10038"></a><a name="Configuring+Hadoop"></a>
+<h3 class="h4"> Configuring Hadoop </h3>
+<p>
+        Dynamic parameters of Hadoop configuration, such as the NameNode and JobTracker addresses
and ports, and file system temporary directories are generated and distributed by HOD automatically
to all nodes in the cluster. In addition, HOD allows the user to configure Hadoop parameters
at both the server (for e.g. JobTracker) and client (for e.g. JobClient) level, including
'final' parameters, that were introduced with Hadoop 0.15.
+        </p>
+<a name="N10042"></a><a name="Auto-cleanup+of+Unused+Clusters"></a>
+<h3 class="h4"> Auto-cleanup of Unused Clusters </h3>
+<p>
+        HOD has an automatic timeout so that users cannot misuse resources they aren't using.
The timeout applies only when there is no MapReduce job running. 
+        </p>
+<a name="N1004C"></a><a name="Log+Services"></a>
+<h3 class="h4"> Log Services </h3>
+<p>
+        HOD can be used to collect all MapReduce logs to a central location for archiving
and inspection after the job is completed.
+        </p>
+</div>
+
+    
+<a name="N10057"></a><a name="HOD+Components"></a>
+<h2 class="h3"> HOD Components </h2>
+<div class="section">
+<p>
+      This is a brief overview of the various components of HOD and how they interact to
provision Hadoop.
+      </p>
+<a name="N10060"></a><a name="HOD+Client"></a>
+<h3 class="h4"> HOD Client </h3>
+<p>
+        The HOD client is a Unix command that users use to allocate Hadoop MapReduce clusters.
The command provides other options to list allocated clusters and deallocate them. The HOD
client generates the <em>hadoop-site.xml</em> in a user specified directory. The
user can point to this configuration file while running Map/Reduce jobs on the allocated cluster.
+        </p>
+<p>
+        The nodes from where the HOD Client is run are called <em>submit nodes</em>
because jobs are submitted to the resource manager system for allocating and running clusters
from these nodes.
+        </p>
+<a name="N10073"></a><a name="RingMaster"></a>
+<h3 class="h4"> RingMaster </h3>
+<p>
+        The RingMaster is a HOD process that is started on one node per every allocated cluster.
It is submitted as a 'job' to the resource manager by the HOD client. It controls which Hadoop
daemons start on which nodes. It provides this information to other HOD processes, such as
the HOD client, so users can also determine this information. The RingMaster is responsible
for hosting and distributing the Hadoop tarball to all nodes in the cluster. It also automatically
cleans up unused clusters.
+        </p>
+<p>
+        
+</p>
+<a name="N10080"></a><a name="HodRing"></a>
+<h3 class="h4"> HodRing </h3>
+<p>
+        The HodRing is a HOD process that runs on every allocated node in the cluster. These
processes are run by the RingMaster through the resource manager, using a facility of parallel
execution. The HodRings are responsible for launching Hadoop commands on the nodes to bring
up the Hadoop daemons. They get the command to launch from the RingMaster.
+        </p>
+<a name="N1008A"></a><a name="Hodrc+%2F+HOD+configuration+file"></a>
+<h3 class="h4"> Hodrc / HOD configuration file </h3>
+<p>
+        An INI style configuration file where the users configure various options for the
HOD system, including install locations of different software, resource manager parameters,
log and temp file directories, parameters for their MapReduce jobs, etc.
+        </p>
+<a name="N10094"></a><a name="Submit+Nodes+and+Compute+Nodes"></a>
+<h3 class="h4"> Submit Nodes and Compute Nodes </h3>
+<p>
+        The nodes from where the <em>HOD Client</em> is run are referred as <em>submit
nodes</em> because jobs are submitted to the resource manager system for allocating
and running clusters from these nodes.
+        </p>
+<p>
+        The nodes where the <em>Ringmaster</em> and <em>HodRings</em>
run are called the Compute nodes. These are the nodes that get allocated by a resource manager,
and on which the Hadoop daemons are provisioned and started.
+        </p>
+</div>
+
+    
+<a name="N100AE"></a><a name="Getting+Started+with+HOD"></a>
+<h2 class="h3"> Getting Started with HOD </h2>
+<div class="section">
+<a name="N100B4"></a><a name="Pre-Requisites"></a>
+<h3 class="h4"> Pre-Requisites </h3>
+<a name="N100BA"></a><a name="Hardware"></a>
+<h4> Hardware </h4>
+<p>
+          HOD requires a minimum of 3 nodes configured through a resource manager.
+          </p>
+<a name="N100C4"></a><a name="Software"></a>
+<h4> Software </h4>
+<p>
+          The following components are assumed to be installed before using HOD:
+          </p>
+<ul>
+            
+<li>
+              
+<em>Torque:</em> Currently HOD supports Torque out of the box. We assume that
you are familiar with configuring Torque. You can get information about this from <a href="http://www.clusterresources.com/wiki/doku.php?id=torque:torque_wiki">here</a>.
+            </li>
+            
+<li>
+              
+<em>Python:</em> We require version 2.5.1, which can be downloaded from <a
href="http://www.python.org/">here</a>.
+            </li>
+          
+</ul>
+<p>
+          The following components can be optionally installed for getting better functionality
from HOD:
+          </p>
+<ul>
+            
+<li>
+              
+<em>Twisted Python:</em> This can be used for improving the scalability of HOD.
Twisted Python is available <a href="http://twistedmatrix.com/trac/">here</a>.
+            </li>
+            
+<li>
+            
+<em>Hadoop:</em> HOD can automatically distribute Hadoop to all nodes in the
cluster. However, it can also use a pre-installed version of Hadoop, if it is available on
all nodes in the cluster. HOD currently supports only Hadoop 0.16, which is under development.
+            </li>
+          
+</ul>
+<p>
+          HOD configuration requires the location of installs of these components to be the
same on all nodes in the cluster. It will also make the configuration simpler to have the
same location on the submit nodes.
+          </p>
+<a name="N100FE"></a><a name="Resource+Manager+Configuration+Pre-requisites"></a>
+<h4>Resource Manager Configuration Pre-requisites</h4>
+<p>
+          For using HOD with Torque:
+          </p>
+<ul>
+            
+<li>
+            Install Torque components: pbs_server on a head node, pbs_moms on all compute
nodes, and PBS client tools on all compute nodes and submit nodes.
+            </li>
+            
+<li>
+            Create a queue for submitting jobs on the pbs_server.
+            </li>
+            
+<li>
+            Specify a name for all nodes in the cluster, by setting a 'node property' to
all the nodes. This can be done by using the 'qmgr' command. For example:
+            <em>qmgr -c "set node node properties=cluster-name"</em>
+            
+</li>
+            
+<li>
+            Ensure that jobs can be submitted to the nodes. This can be done by using the
'qsub' command. For example:
+            <em>echo "sleep 30" | qsub -l nodes=3</em>
+            
+</li>
+          
+</ul>
+<p>
+          More information about setting up Torque can be found by referring to the documentation
<a href="http://www.clusterresources.com/pages/products/torque-resource-manager.php">here.</a>
+          
+</p>
+<a name="N10125"></a><a name="Setting+up+HOD"></a>
+<h3 class="h4">Setting up HOD</h3>
+<ul>
+          
+<li>
+          HOD is available in the 'contrib' section of Hadoop under the root directory 'hod'.
Distribute the files under this directory to all the nodes in the cluster.
+          </li>
+          
+<li>
+          On the node from where you want to run hod, edit the file hodrc which can be found
in the <em>install dir/conf</em> directory. This file contains the minimal set
of values required for running hod.
+          </li>
+          
+<li>
+          Specify values suitable to your environment for the following variables defined
in the configuration file. Note that some of these variables are defined at more than one
place in the file.
+          </li>
+      
+</ul>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+          
+<tr>
+            
+<th colspan="1" rowspan="1"> Variable Name </th>
+            <th colspan="1" rowspan="1"> Meaning </th>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1"> ${JAVA_HOME} </td>
+            <td colspan="1" rowspan="1"> Location of Java for Hadoop. Hadoop supports
Sun JDK 1.5.x </td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1"> ${CLUSTER_NAME} </td>
+            <td colspan="1" rowspan="1"> Name of the cluster which is specified in
the 'node property' as mentioned in resource manager configuration. </td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1"> ${HADOOP_HOME} </td>
+            <td colspan="1" rowspan="1"> Location of Hadoop installation on the compute
and submit nodes. </td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1"> ${RM_QUEUE} </td>
+            <td colspan="1" rowspan="1"> Queue configured for submiting jobs in the
resource manager configuration. </td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1"> ${RM_HOME} </td>
+            <td colspan="1" rowspan="1"> Location of the resource manager installation
on the compute and submit nodes. </td>
+          
+</tr>
+        
+</table>
+<ul>
+          
+<li>
+          The following environment variables *may* need to be set depending on your environment.
These variables must be defined where you run the HOD client, and also be specified in the
HOD configuration file as the value of the key resource_manager.env-vars. Multiple variables
can be specified as a comma separated list of key=value pairs.
+          </li>
+        
+</ul>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+          
+<tr>
+            
+<th colspan="1" rowspan="1"> Variable Name </th>
+            <th colspan="1" rowspan="1"> Meaning </th>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">HOD_PYTHON_HOME</td>
+            <td colspan="1" rowspan="1">
+            If you install python to a non-default location of the compute nodes, or submit
nodes, then, this variable must be defined to point to the python executable in the non-standard
  location.
+            </td>
+          
+</tr>
+        
+</table>
+<p>
+        You can also review other configuration options in the file and modify them to suit
your needs. Refer to the the section on configuration below for information about the HOD
configuration.
+        </p>
+</div>
+
+    
+<a name="N101B3"></a><a name="Running+HOD"></a>
+<h2 class="h3">Running HOD</h2>
+<div class="section">
+<a name="N101B9"></a><a name="Overview"></a>
+<h3 class="h4">Overview</h3>
+<p>
+        A typical session of HOD will involve atleast three steps: allocate, run hadoop jobs,
deallocate.
+        </p>
+<a name="N101C2"></a><a name="Operation+allocate"></a>
+<h4>Operation allocate</h4>
+<p>
+          The allocate operation is used to allocate a set of nodes and install and provision
Hadoop on them. It has the following syntax:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">hod -c config_file -t hadoop_tarball_location -o "allocate
                cluster_dir number_of_nodes"</td>
+            
+</tr>
+          
+</table>
+<p>
+          The hadoop_tarball_location must be a location on a shared file system accesible
from all nodes in the cluster. Note, the cluster_dir must exist before running the command.
If the command completes successfully then cluster_dir/hadoop-site.xml will be generated and
will contain information about the allocated cluster's JobTracker and NameNode.
+          </p>
+<p>
+          For example, the following command uses a hodrc file in ~/hod-config/hodrc and
allocates Hadoop (provided by the tarball ~/share/hadoop.tar.gz) on 10 nodes, storing the
generated Hadoop configuration in a directory named <em>~/hadoop-cluster</em>:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">$ hod -c ~/hod-config/hodrc -t ~/share/hadoop.tar.gz -o
"allocate ~/hadoop-cluster 10"</td>
+            
+</tr>
+          
+</table>
+<p>
+          HOD also supports an environment variable called <em>HOD_CONF_DIR</em>.
If this is defined, HOD will look for a default hodrc file at $HOD_CONF_DIR/hodrc. Defining
this allows the above command to also be run as follows:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">
+                
+<p>$ export HOD_CONF_DIR=~/hod-config</p>
+                
+<p>$ hod -t ~/share/hadoop.tar.gz -o "allocate ~/hadoop-cluster 10"</p>
+              
+</td>
+            
+</tr>
+          
+</table>
+<a name="N10202"></a><a name="Running+Hadoop+jobs+using+the+allocated+cluster"></a>
+<h4>Running Hadoop jobs using the allocated cluster</h4>
+<p>
+          Now, one can run Hadoop jobs using the allocated cluster in the usual manner:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">hadoop --config cluster_dir hadoop_command hadoop_command_args</td>
+            
+</tr>
+          
+</table>
+<p>
+          Continuing our example, the following command will run a wordcount example on the
allocated cluster:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">$ hadoop --config ~/hadoop-cluster jar /path/to/hadoop/hadoop-examples.jar
wordcount /path/to/input /path/to/output</td>
+            
+</tr>
+          
+</table>
+<a name="N10225"></a><a name="Operation+deallocate"></a>
+<h4>Operation deallocate</h4>
+<p>
+          The deallocate operation is used to release an allocated cluster. When finished
with a cluster, deallocate must be run so that the nodes become free for others to use. The
deallocate operation has the following syntax:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">hod -o "deallocate cluster_dir"</td>
+            
+</tr>
+          
+</table>
+<p>
+          Continuing our example, the following command will deallocate the cluster:
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<td colspan="1" rowspan="1">$ hod -o "deallocate ~/hadoop-cluster"</td>
+            
+</tr>
+          
+</table>
+<a name="N10249"></a><a name="Command+Line+Options"></a>
+<h3 class="h4">Command Line Options</h3>
+<p>
+        This section covers the major command line options available via the hod command:
+        </p>
+<p>
+        
+<em>--help</em>
+        
+</p>
+<p>
+        Prints out the help message to see the basic options.
+        </p>
+<p>
+        
+<em>--verbose-help</em>
+        
+</p>
+<p>
+        All configuration options provided in the hodrc file can be passed on the command
line, using the syntax --section_name.option_name[=value]. When provided this way, the value
provided on command line overrides the option provided in hodrc. The verbose-help command
lists all the available options in the hodrc file. This is also a nice way to see the meaning
of the configuration options.
+        </p>
+<p>
+        
+<em>-c config_file</em>
+        
+</p>
+<p>
+        Provides the configuration file to use. Can be used with all other options of HOD.
Alternatively, the HOD_CONF_DIR environment variable can be defined to specify a directory
that contains a file named hodrc, alleviating the need to specify the configuration file in
each HOD command.
+        </p>
+<p>
+        
+<em>-b 1|2|3|4</em>
+        
+</p>
+<p>
+        Enables the given debug level. Can be used with all other options of HOD. 4 is most
verbose.
+        </p>
+<p>
+        
+<em>-o "help"</em>
+        
+</p>
+<p>
+        Lists the operations available in the operation mode.
+        </p>
+<p>
+        
+<em>-o "allocate cluster_dir number_of_nodes"</em>
+        
+</p>
+<p>
+        Allocates a cluster on the given number of cluster nodes, and store the allocation
information in cluster_dir for use with subsequent hadoop commands. Note that the cluster_dir
must exist before running the command.
+        </p>
+<p>
+        
+<em>-o "list"</em>
+        
+</p>
+<p>
+        Lists the clusters allocated by this user. Information provided includes the Torque
job id corresponding to the cluster, the cluster directory where the allocation information
is stored, and whether the Map/Reduce daemon is still active or not.
+        </p>
+<p>
+        
+<em>-o "info cluster_dir"</em>
+        
+</p>
+<p>
+        Lists information about the cluster whose allocation information is stored in the
specified cluster directory.
+        </p>
+<p>
+        
+<em>-o "deallocate cluster_dir"</em>
+        
+</p>
+<p>
+       Deallocates the cluster whose allocation information is stored in the specified cluster
directory.
+        </p>
+<p>
+        
+<em>-t hadoop_tarball</em>
+        
+</p>
+<p>
+        Provisions Hadoop from the given tar.gz file. This option is only applicable to the
allocate operation. For better distribution performance it is recommended that the Hadoop
tarball contain only the libraries and binaries, and not the source or documentation. 
+        </p>
+<p>
+        
+<em>-Mkey1=value1 -Mkey2=value2</em>
+        
+</p>
+<p>
+        Provides configuration parameters for the provisioned Map/Reduce daemons (JobTracker
and TaskTrackers). A hadoop-site.xml is generated with these values on the cluster nodes
+        </p>
+<p>
+        
+<em>-Hkey1=value1 -Hkey2=value2</em>
+        
+</p>
+<p>
+        Provides configuration parameters for the provisioned HDFS daemons (NameNode and
DataNodes). A hadoop-site.xml is generated with these values on the cluster nodes
+        </p>
+<p>
+        
+<em>-Ckey1=value1 -Ckey2=value2</em>
+        
+</p>
+<p>
+        Provides configuration parameters for the client from where jobs can be submitted.
A hadoop-site.xml is generated with these values on the submit node.
+        </p>
+</div>
+    
+<a name="N102C9"></a><a name="HOD+Configuration"></a>
+<h2 class="h3"> HOD Configuration </h2>
+<div class="section">
+<a name="N102CF"></a><a name="Introduction+to+HOD+Configuration"></a>
+<h3 class="h4"> Introduction to HOD Configuration </h3>
+<p>
+        Configuration options for HOD are organized as sections and options within them.
They can be specified in two ways: a configuration file in the INI format, and as command
line options to the HOD shell, specified in the format --section.option[=value]. If the same
option is specified in both places, the value specified on the command line overrides the
value in the configuration file.
+        </p>
+<p>
+        To get a simple description of all configuration options, you can type <em>hod
--verbose-help</em>
+        
+</p>
+<p>
+        This section explains some of the most important or commonly used configuration options
in some more detail.
+        </p>
+<a name="N102E2"></a><a name="Categories+%2F+Sections+in+HOD+Configuration"></a>
+<h3 class="h4"> Categories / Sections in HOD Configuration </h3>
+<p>
+        The following are the various sections in the HOD configuration:
+        </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+          
+<tr>
+            
+<th colspan="1" rowspan="1"> Section Name </th>
+            <th colspan="1" rowspan="1"> Description </th>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">hod</td>
+            <td colspan="1" rowspan="1">Options for the HOD client</td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">resource_manager</td>
+            <td colspan="1" rowspan="1">Options for specifying which resource manager
to use, and other parameters for using that resource manager</td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">ringmaster</td>
+            <td colspan="1" rowspan="1">Options for the RingMaster process</td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">hodring</td>
+            <td colspan="1" rowspan="1">Options for the HodRing process</td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">gridservice-mapred</td>
+            <td colspan="1" rowspan="1">Options for the MapReduce daemons</td>
+          
+</tr>
+          
+<tr>
+            
+<td colspan="1" rowspan="1">gridservice-hdfs</td>
+            <td colspan="1" rowspan="1">Options for the HDFS daemons</td>
+          
+</tr>
+        
+</table>
+<a name="N1034A"></a><a name="Important+and+Commonly+Used+Configuration+Options"></a>
+<h3 class="h4"> Important and Commonly Used Configuration Options </h3>
+<a name="N10350"></a><a name="Common+configuration+options"></a>
+<h4> Common configuration options </h4>
+<p>
+          Certain configuration options are defined in most of the sections of the HOD configuration.
Options defined in a section, are used by the process for which that section applies. These
options have the same meaning, but can have different values in each section.
+          </p>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">temp-dir</td>
+              <td colspan="1" rowspan="1">Temporary directory for usage by the HOD
processes. Make sure that the users who will run hod have rights to create directories under
the directory specified here.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">debug</td>
+              <td colspan="1" rowspan="1">A numeric value from 1-4. 4 produces the
most log information, and 1 the least.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">log-dir</td>
+              <td colspan="1" rowspan="1">Directory where log files are stored. By
default, this is <em>install-location/logs/</em>. The restrictions and notes for
the temp-dir variable apply here too.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">xrs-port-range</td>
+              <td colspan="1" rowspan="1">A range of ports, among which an available
port shall be picked for use to run any XML-RPC based server daemon processes of HOD.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">http-port-range</td>
+              <td colspan="1" rowspan="1">A range of ports, among which an available
port shall be picked for use to run any HTTP based server daemon processes of HOD.</td>
+            
+</tr>
+          
+</table>
+<a name="N103AE"></a><a name="hod+options"></a>
+<h4> hod options </h4>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">cluster</td>
+              <td colspan="1" rowspan="1">A descriptive name given to the cluster.
For Torque, this is specified as a 'Node property' for every node in the cluster. HOD uses
this value to compute the number of available nodes.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">client-params</td>
+              <td colspan="1" rowspan="1">A comma-separated list of hadoop config parameters
specified as key-value pairs. These will be used to generate a hadoop-site.xml on the submit
node that should be used for running MapReduce jobs.</td>
+            
+</tr>
+          
+</table>
+<a name="N103DF"></a><a name="resource_manager+options"></a>
+<h4> resource_manager options </h4>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">queue</td>
+              <td colspan="1" rowspan="1">Name of the queue configured in the resource
manager to which jobs are to be submitted.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">batch-home</td>
+              <td colspan="1" rowspan="1">Install directory to which 'bin' is appended
and under which the executables of the resource manager can be found. </td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">env-vars</td>
+              <td colspan="1" rowspan="1">This is a comma separated list of key-value
pairs, expressed as key=value, which would be passed to the jobs launched on the compute nodes.
For example, if the python installation is in a non-standard location, one can set the environment
variable 'HOD_PYTHON_HOME' to the path to the python executable. The HOD processes launched
on the compute nodes can then use this variable.</td>
+            
+</tr>
+          
+</table>
+<a name="N1041D"></a><a name="ringmaster+options"></a>
+<h4> ringmaster options </h4>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">work-dirs</td>
+              <td colspan="1" rowspan="1">These are a list of comma separated paths
that will serve as the root for directories that HOD generates and passes to Hadoop for use
to store DFS / MapReduce data. For e.g. this is where DFS data blocks will be stored. Typically,
as many paths are specified as there are disks available to ensure all disks are being utilized.
The restrictions and notes for the temp-dir variable apply here too.</td>
+            
+</tr>
+          
+</table>
+<a name="N10441"></a><a name="gridservice-hdfs+options"></a>
+<h4> gridservice-hdfs options </h4>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">external</td>
+              <td colspan="1" rowspan="1">
+              
+<p> If false, this indicates that a HDFS cluster must be bought up by the HOD system,
on the nodes which it allocates via the allocate command. Note that in that case, when the
cluster is de-allocated, it will bring down the HDFS cluster, and all the data will be lost.
If true, it will try and connect to an externally configured HDFS system. </p>
+              
+<p>Typically, because input for jobs are placed into HDFS before jobs are run, and
also the output from jobs in HDFS is required to be persistent, an internal HDFS cluster is
of little value in a production system. However, it allows for quick testing.</p>
+              
+</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">host</td>
+              <td colspan="1" rowspan="1">Hostname of the externally configured NameNode,
if any.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">fs_port</td>
+              <td colspan="1" rowspan="1">Port to which NameNode RPC server is bound.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">info_port</td>
+              <td colspan="1" rowspan="1">Port to which the NameNode web UI server
is bound.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">pkgs</td>
+              <td colspan="1" rowspan="1">Installation directory, under which bin/hadoop
executable is located. This can be used to use a pre-installed version of Hadoop on the cluster.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">server-params</td>
+              <td colspan="1" rowspan="1">A comma-separated list of hadoop config parameters
specified key-value pairs. These will be used to generate a hadoop-site.xml that will be used
by the NameNode and DataNodes.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">final-server-params</td>
+              <td colspan="1" rowspan="1">Same as above, except they will be marked
final.</td>
+            
+</tr>
+          
+</table>
+<a name="N104B9"></a><a name="gridservice-mapred+options"></a>
+<h4> gridservice-mapred options </h4>
+<table class="ForrestTable" cellspacing="1" cellpadding="4">
+            
+<tr>
+              
+<th colspan="1" rowspan="1"> Option Name </th>
+              <th colspan="1" rowspan="1"> Description </th>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">external</td>
+              <td colspan="1" rowspan="1">
+              
+<p> If false, this indicates that a MapReduce cluster must be bought up by the HOD
system on the nodes which it allocates via the allocate command. If true, if will try and
connect to an externally configured MapReduce system.</p>
+              
+</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">host</td>
+              <td colspan="1" rowspan="1">Hostname of the externally configured JobTracker,
if any.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">tracker_port</td>
+              <td colspan="1" rowspan="1">Port to which the JobTracker RPC server is
bound.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">info_port</td>
+              <td colspan="1" rowspan="1">Port to which the JobTracker web UI server
is bound.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">pkgs</td>
+              <td colspan="1" rowspan="1">Installation directory, under which bin/hadoop
executable is located. This can be used to use a pre-installed version of Hadoop on the cluster.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">server-params</td>
+              <td colspan="1" rowspan="1">A comma-separated list of hadoop config parameters
specified key-value pairs. These will be used to generate a hadoop-site.xml that will be used
by the JobTracker and TaskTrackers.</td>
+            
+</tr>
+            
+<tr>
+              
+<td colspan="1" rowspan="1">final-server-params</td>
+              <td colspan="1" rowspan="1">Same as above, except they will be marked
final.</td>
+            
+</tr>
+          
+</table>
+</div>
+  
+</div>
+<div class="clearboth">&nbsp;</div>
+</div>
+<div id="footer">
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("<text>Last Published:</text> " + document.lastModified);
+//  --></script>
+</div>
+<div class="copyright">
+        Copyright &copy;
+         2007 <a href="http://www.apache.org/licenses/">The Apache Software Foundation.</a>
+</div>
+</div>
+</body>
+</html>



Mime
View raw message