hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "GettingStartedWithHadoop" by SameerParanjpye
Date Tue, 19 Sep 2006 20:00:21 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by SameerParanjpye:
http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

------------------------------------------------------------------------------
  = Downloading and installing Hadoop =
  
- Hadoop can be downloaded from [http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ here].
You may also 
+ Hadoop can be downloaded from [http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ here].
You may also
- download a nightly build from [http://cvs.apache.org/dist/lucene/hadoop/nightly/ here] or
check out the 
+ download a nightly build from [http://cvs.apache.org/dist/lucene/hadoop/nightly/ here] or
check out the
- code from [http://lucene.apache.org/hadoop/version_control.html subversion] and build it
with 
+ code from [http://lucene.apache.org/hadoop/version_control.html subversion] and build it
with
  [http://ant.apache.org Ant]. Select a directory to install Hadoop under (let's call it ''hadoop-install'')
  and untar the tarball in that directory. If you downloaded version ''<ver>'' of Hadoop,
untarring will
- create a directory called ''hadoop-<ver>'' in the ''hadoop-install'' directory. All
scripts and tools 
+ create a directory called ''hadoop-<ver>'' in the ''hadoop-install'' directory. All
scripts and tools
  used to run Hadoop will be present in the directory ''hadoop-<ver>/bin''. All configuration
files for
  Hadoop will be present in the directory ''hadoop-<ver>/conf''. These directories will
subsequently be
  referred to as ''hadoop/bin'' and ''hadoop/conf'' respectively in this document.
@@ -36, +36 @@

  
  More details on configuration can be found on the HowToConfigure page.
  
+ == Setting up Hadoop on a single node ==
- = Starting Hadoop using Hadoop scripts =
- This section explains how to set up a Hadoop cluster running Hadoop DFS and Hadoop Mapreduce.
The startup scripts are in hadoop/bin. The file that contains all the slave nodes that would
join the DFS and map reduce cluster is the slaves file in hadoop/conf. Edit the slaves file
to add nodes to your cluster. You need to edit the slaves file only on the machines you plan
to run the Jobtracker and Namenode on. In case you want to run a single node cluster you do
not have to edit the slaves file.  Next edit the file hadoop-env.sh in the hadoop/conf directory.
Make sure JAVA_HOME is set correctly. You can change the other environment variables as per
your requirements. HADOOP_HOME is automatically determined depending on where you run your
hadoop scripts from.
  
+ This section describes how to get started by setting up a Hadoop cluster on a single node.
The setup described here is an HDFS instance with a namenode and a single datanode and a Map/Reduce
cluster with a jobtracker and a single tasktracker. The configuration procedures described
in Basic Configuration are just as applicable for larger clusters.
  
+ === Basic Configuration ===
- == Environment Variables ==
-  * The only environment variable that you may need to specify is HADOOP_CONF_DIR. Set this
variable to your configure directory which contains hadoop-site.xml, hadoop-env.sh and the
slaves file. Set this environment variable on all the machines you plan to run Hadoop on.
In case you are running bash, you can set it in .bashrc and in case of csh set it in .cshrc.
For more information on how to configure Hadoop, take a look at HowToConfigure section.
-  * You can get rid of this environment variable by specifying the configure directory as
a --config option for the scripts. All the hadoop scripts take a --config argument which is
the configure directory.
  
- == Configuration Parameters ==
- * Change hadoop-site.xml in the configure directory to change the default properties. Take
a look at hadoop-default.xml to see how to add properties to hadoop-site.xml. The properties
that you would mostly change are the ports and hosts for Namenode and Jobtracker. You should
propagate these changes to all the nodes in your cluster.
-   
+ Take a pass at putting together basic configuration settings for your cluster. Some of the
settings that follow are required, others are recommended for more straightforward and predictable
operation.
+ 
+  * '''Hadoop Environment Settings''' - Make sure JAVA_HOME is set in ''hadoop-env.sh'' and
points to the Java installation you intend to use. You can set other environment variables
in ''hadoop-env.sh'' to suit your requirments. Some of the default settings refer to the variable
HADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup
scripts. HADOOP_HOME is the parent directory of the ''bin'' directory that holds the Hadoop
scripts. So, if the scripts are in ''/foo/bar/hadoop-install/hadoop-<ver>/bin'', then
HADOOP_HOME is ''/foo/bar/hadoop-install/hadoop-<ver>''.
+ 
+  * '''Jobtracker and Namenode settings''' - Figure out where to run your namenode and jobtracker.
Set the variable ''fs.default.name'' to the Namenodes intended host:port. Set the variable
''mapred.job.tracker'' to the jobtrackers intended host:port. These setting should be in ''hadoop-site.xml''.
You may also want to set one or more of the following ports (also in ''hadoop-site.xml''):
+   * dfs.datanode.port
+   * dfs.info.port
+   * mapred.job.tracker.info.port
+   * mapred.task.tracker.ouput.port
+   * mapred.task.tracker.report.port
+ 
+  * '''Data Path Settings''' - Figure out where your data goes. This includes settings for
where the namenode stores the namespace checkpoint and the edits log, where the datanodes
store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary
storage for the HDFS client. The default values for these paths point to various locations
in ''/tmp''. While this might be ok for a single node installation for larger clusters, storing
data in ''/tmp'' is not an option. These settings must also be in ''hadoop-site.xml''. It
is important for these settings to be present in ''hadoop-site.xml'' because they can otherwise
be overridden by client configuration settings in Map/Reduce jobs. Set the following variables
to appropriate values:
+   * dfs.name.dir
+   * dfs.data.dir
+   * dfs.client.buffer.dir
+   * mapred.local.dir
+ 
  == Formatting the Namenode ==
   * You are required to format the Namenode for your first installation. This is true only
for your first installation. Do not format a Namenode which was already running Hadoop. It
will clear up your DFS. Run bin/hadoop namenode -format to format your Namenode.
  
@@ -57, +69 @@

  
  === Starting up a real cluster ===
   * After formatting the namenode run bin/start-dfs.sh on the Namenode. This will bring up
the dfs with Namenode running on the machine you ran the command on and Datanodes  on the
machines listed in the slaves file mentioned above.
-  * Run bin/start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring
up the map reduce cluster with Jobtracker running on the machine you ran the command on and
Tasktrackers running on machines listed in the slaves file. 
+  * Run bin/start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring
up the map reduce cluster with Jobtracker running on the machine you ran the command on and
Tasktrackers running on machines listed in the slaves file.
   * In case you have not set the HADOOP_CONF_DIR variable, you can use bin/start-mapred.sh
(bin/start-dfs.sh) --config configure_directory.
   * Try executing bin/hadoop dfs -lsr / to see if it is working.
  

Mime
View raw message