hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "GettingStartedWithHadoop" by SameerParanjpye
Date Wed, 20 Sep 2006 00:23:37 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by SameerParanjpye:
http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

------------------------------------------------------------------------------
  = Downloading and installing Hadoop =
  
+ Hadoop can be downloaded from one of the [http://www.apache.org/dyn/closer.cgi/lucene/hadoop/
Apache download mirrors]. You may also download a [http://cvs.apache.org/dist/lucene/hadoop/nightly/
nightly build] or check out the code from [http://lucene.apache.org/hadoop/version_control.html
subversion] and build it with [http://ant.apache.org Ant]. Select a directory to install Hadoop
under (let's say {{{/foo/bar/hadoop-install}}}) and untar the tarball in that directory. A
directory corresponding to the version of Hadoop downloaded will be created under the {{{/foo/bar/hadoop-install}}}
directory. For instance, if version 0.6.0 of Hadoop was downloaded untarring as described
above will create the directory {{{/foo/bar/hadoop-install/hadoop-0.6.0}}}. The examples in
this document assume the existence of an environment variable {{{$HADOOP_INSTALL}}} that represents
the path to all versions of Hadoop installed. In the above instance {{{HADOOP_INSTALL=/foo/bar/hadoop-install}}}.
They further assume the existence of a symlink named {{{hadoop}}} in {{{$HADOOP_INSTALL}}}
that points to the version of Hadoop being used. For instance, if verision 0.6.0 is being
used then {{{$HADOOP_INSTALL/hadoop -> hadoop-0.6.0}}}. All tools used to run Hadoop will
be present in the directory {{{$HADOOP_INSTALL/hadoop/bin}}}. All configuration files for
Hadoop will be present in the directory {{{$HADOOP_INSTALL/hadoop/conf}}}.
- Hadoop can be downloaded from [http://www.apache.org/dyn/closer.cgi/lucene/hadoop/ here].
You may also
- download a nightly build from [http://cvs.apache.org/dist/lucene/hadoop/nightly/ here] or
check out the
- code from [http://lucene.apache.org/hadoop/version_control.html subversion] and build it
with
- [http://ant.apache.org Ant]. Select a directory to install Hadoop under (let's call it ''hadoop-install'')
- and untar the tarball in that directory. If you downloaded version ''<ver>'' of Hadoop,
untarring will
- create a directory called ''hadoop-<ver>'' in the ''hadoop-install'' directory. All
scripts and tools
- used to run Hadoop will be present in the directory ''hadoop-<ver>/bin''. All configuration
files for
- Hadoop will be present in the directory ''hadoop-<ver>/conf''. These directories will
subsequently be
- referred to as ''hadoop/bin'' and ''hadoop/conf'' respectively in this document.
  
  == Startup scripts ==
  
- The ''hadoop/bin'' directory contains some scripts used to launch Hadoop DFS and Hadoop
Map/Reduce daemons. These
+ The {{{$HADOOP_INSTALL/hadoop/bin}}} directory contains some scripts used to launch Hadoop
DFS and Hadoop Map/Reduce daemons. These are:
- are:
  
-  * ''start-all.sh'' - Starts all Hadoop daemons, the namenode, datanodes, the jobtracker
and tasktrackers.
+  * {{{start-all.sh}}} - Starts all Hadoop daemons, the namenode, datanodes, the jobtracker
and tasktrackers.
-  * ''stop-all.sh'' - Stops all Hadoop daemons.
+  * {{{stop-all.sh}}} - Stops all Hadoop daemons.
-  * ''start-mapred.sh'' - Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
+  * {{{start-mapred.sh}}} - Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
-  * ''stop-mapred.sh'' - Stops the Hadoop Map/Reduce daemons.
+  * {{{stop-mapred.sh}}} - Stops the Hadoop Map/Reduce daemons.
-  * ''start-dfs.sh'' - Starts the Hadoop DFS daemons, the namenode and datanodes.
+  * {{{start-dfs.sh}}} - Starts the Hadoop DFS daemons, the namenode and datanodes.
-  * ''stop-dfs.sh'' - Stops the Hadoop DFS daemons.
+  * {{{stop-dfs.sh}}} - Stops the Hadoop DFS daemons.
  
  == Configuration files ==
  
- The ''hadoop/conf'' directory contains some configuration files for Hadoop. These are:
+ The {{{$HADOOP_INSTALL/hadoop/conf}}} directory contains some configuration files for Hadoop.
These are:
  
-  * ''hadoop-env.sh'' - This file contains some environment variable settings used by Hadoop.
You can use these to affect some aspects of Hadoop daemon behavior, such as where log files
are stored, the maximum amount of heap used etc. The only variable you should need to change
in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.
+  * {{{hadoop-env.sh}}} - This file contains some environment variable settings used by Hadoop.
You can use these to affect some aspects of Hadoop daemon behavior, such as where log files
are stored, the maximum amount of heap used etc. The only variable you should need to change
in this file is {{{JAVA_HOME}}}, which specifies the path to the Java installation used by
Hadoop.
-  * ''slaves'' - This file lists the hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run. By default this contains the single entry ''localhost''
+  * {{{slaves}}} - This file lists the hosts, one per line, where the Hadoop slave daemons
(datanodes and tasktrackers) will run. By default this contains the single entry {{{localhost}}}
-  * ''hadoop-default.xml'' - This file contains generic default settings for Hadoop daemons
and Map/Reduce jobs. '''Do not modify this file.'''
+  * {{{hadoop-default.xml}}} - This file contains generic default settings for Hadoop daemons
and Map/Reduce jobs. '''Do not modify this file.'''
-  * ''mapred-default.xml'' - This file contains site specific settings for the Hadoop Map/Reduce
daemons and jobs. The file is empty by default. Putting configuration properties in this file
will override Map/Reduce settings in the ''hadoop-default.xml'' file. Use this file to tailor
the behavior of Map/Reduce on your site.
+  * {{{mapred-default.xml}}} - This file contains site specific settings for the Hadoop Map/Reduce
daemons and jobs. The file is empty by default. Putting configuration properties in this file
will override Map/Reduce settings in the {{{hadoop-default.xml}}} file. Use this file to tailor
the behavior of Map/Reduce on your site.
-  * ''hadoop-site.xml'' - This file contains site specific settings for all Hadoop daemons
and Map/Reduce jobs. This file is empty by default. Settings in this file override the settings
in ''hadoop-default.xml'' and ''mapred-default.xml''. This file should contain settings that
must be respected by all servers and clients in a Hadoop installation, for instance, the location
of the namenode and the jobtracker.
+  * {{{hadoop-site.xml}}} - This file contains site specific settings for all Hadoop daemons
and Map/Reduce jobs. This file is empty by default. Settings in this file override those in
{{{hadoop-default.xml}}} and {{{mapred-default.xml}}}. This file should contain settings that
must be respected by all servers and clients in a Hadoop installation, for instance, the location
of the namenode and the jobtracker.
  
  More details on configuration can be found on the HowToConfigure page.
  
@@ -44, +35 @@

  
  Take a pass at putting together basic configuration settings for your cluster. Some of the
settings that follow are required, others are recommended for more straightforward and predictable
operation.
  
-  * '''Hadoop Environment Settings''' - Make sure JAVA_HOME is set in ''hadoop-env.sh'' and
points to the Java installation you intend to use. You can set other environment variables
in ''hadoop-env.sh'' to suit your requirments. Some of the default settings refer to the variable
HADOOP_HOME. The value of HADOOP_HOME is automatically inferred from the location of the startup
scripts. HADOOP_HOME is the parent directory of the ''bin'' directory that holds the Hadoop
scripts. So, if the scripts are in ''/foo/bar/hadoop-install/hadoop-<ver>/bin'', then
HADOOP_HOME is ''/foo/bar/hadoop-install/hadoop-<ver>''.
+  * '''Hadoop Environment Settings''' - Ensure that {{{JAVA_HOME}}} is set in {{{hadoop-env.sh}}}
and points to the Java installation you intend to use. You can set other environment variables
in {{{hadoop-env.sh}}} to suit your requirments. Some of the default settings refer to the
variable {{{HADOOP_HOME}}}. The value of {{{HADOOP_HOME}}} is automatically inferred from
the location of the startup scripts. {{{HADOOP_HOME}}} is the parent directory of the {{{bin}}}
directory that holds the Hadoop scripts. In this instance it is {{{$HADOOP_INSTALL/hadoop}}}.
+  * '''Jobtracker and Namenode settings''' - Figure out where to run your namenode and jobtracker.
Set the variable {{{fs.default.name}}} to the Namenodes intended host:port. Set the variable
{{{mapred.job.tracker}}} to the jobtrackers intended host:port. These settings should be in
{{{hadoop-site.xml}}}. You may also want to set one or more of the following ports (also in
{{{hadoop-site.xml}}}):
+   * {{{dfs.datanode.port}}}
+   * {{{dfs.info.port}}}
+   * {{{mapred.job.tracker.info.port}}}
+   * {{{mapred.task.tracker.ouput.port}}}
+   * {{{mapred.task.tracker.report.port}}}
  
+  * '''Data Path Settings''' - Figure out where your data goes. This includes settings for
where the namenode stores the namespace checkpoint and the edits log, where the datanodes
store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary
storage for the HDFS client. The default values for these paths point to various locations
in {{{/tmp}}}. While this might be ok for a single node installation, for larger clusters
storing data in {{{/tmp}}} is not an option. These settings must also be in {{{hadoop-site.xml}}}.
It is important for these settings to be present in {{{hadoop-site.xml}}} because they can
otherwise be overridden by client configuration settings in Map/Reduce jobs. Set the following
variables to appropriate values:
+   * {{{dfs.name.dir}}}
+   * {{{dfs.data.dir}}}
+   * {{{dfs.client.buffer.dir}}}
+   * {{{mapred.local.dir}}}
-  * '''Jobtracker and Namenode settings''' - Figure out where to run your namenode and jobtracker.
Set the variable ''fs.default.name'' to the Namenodes intended host:port. Set the variable
''mapred.job.tracker'' to the jobtrackers intended host:port. These setting should be in ''hadoop-site.xml''.
You may also want to set one or more of the following ports (also in ''hadoop-site.xml''):
-   * dfs.datanode.port
-   * dfs.info.port
-   * mapred.job.tracker.info.port
-   * mapred.task.tracker.ouput.port
-   * mapred.task.tracker.report.port
  
+ === Formatting the Namenode ===
-  * '''Data Path Settings''' - Figure out where your data goes. This includes settings for
where the namenode stores the namespace checkpoint and the edits log, where the datanodes
store filesystem blocks, storage locations for Map/Reduce intermediate output and temporary
storage for the HDFS client. The default values for these paths point to various locations
in ''/tmp''. While this might be ok for a single node installation for larger clusters, storing
data in ''/tmp'' is not an option. These settings must also be in ''hadoop-site.xml''. It
is important for these settings to be present in ''hadoop-site.xml'' because they can otherwise
be overridden by client configuration settings in Map/Reduce jobs. Set the following variables
to appropriate values:
-   * dfs.name.dir
-   * dfs.data.dir
-   * dfs.client.buffer.dir
-   * mapred.local.dir
  
+ The first step to starting up your Hadoop installation is formatting the filesystem. You
need to do this the first time you set up a Hadoop installation. '''Do not''' format a running
filesystem, this will cause all your data to be erased. To format the filesystem, run the
command: [[BR]] {{{% $HADOOP_INSTALL/hadoop/bin/hadoop namenode -format}}}
- == Formatting the Namenode ==
-  * You are required to format the Namenode for your first installation. This is true only
for your first installation. Do not format a Namenode which was already running Hadoop. It
will clear up your DFS. Run bin/hadoop namenode -format to format your Namenode.
  
  === Starting a Single node cluster ===
-  * Run bin/start-all.sh. This will startup a Namenode, Datanode, Jobtracker and a Tasktracker
on your machine.
+ Run the command: [[BR]] {{{% $HADOOP_INSTALL/hadoop/bin/start-all.sh}}} [[BR]] This will
startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
+ 
  === Stopping a Single node cluster ===
-  * Run bin/stop-all.sh to stop all the daemons running on your machine.
+ Run the command [[BR]] {{{% $HADOOP_INSTALL/hadoop/bin/stop-all.sh}}} [[BR]] to stop all
the daemons running on your machine.
  
  === Starting up a real cluster ===
   * After formatting the namenode run bin/start-dfs.sh on the Namenode. This will bring up
the dfs with Namenode running on the machine you ran the command on and Datanodes  on the
machines listed in the slaves file mentioned above.

Mime
View raw message