Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: Apache Wiki <wikidiffs@apache.org>
To: hadoop-commits@lucene.apache.org
Date: Fri, 27 Oct 2006 17:58:08 -0000
Message-ID: <20061027175808.8879.19183@ajax.apache.org>
Subject: [Lucene-hadoop Wiki] Update of "AmazonEC2" by DougCutting

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by DougCutting:
http://wiki.apache.org/lucene-hadoop/AmazonEC2

New page:
= Running Hadoop on Amazon EC2 =

[http://www.amazon.com/gp/browse.html?node=201590011 Amazon EC2] is a computing service.  One allocates a set of hosts, and runs ones's application on them, then, when done, returns the hosts.  Billing is hourly per host.  Thus EC2 permits one to deploy Hadoop on a cluster without having to own and operate a cluster, but rather renting it on an hourly basis.

== Concepts ==

 * '''Abstract Machine Image (AMI)''', or ''image''.  A bootable Linux image, with software pre-installed.
 * '''instance'''.  A host running an AMI.
 
== Building an Image ==

To use Hadoop it is easiest to build an image with most of the software you require already installed.  Amazon provides [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=354&categoryID=87 good documentation] for building images.  Follow the tutorial here to learn how to install the EC2 command-line tools, etc.

To build an image for Hadoop:

 1. Start an instance of the fedora base image.

 1. Login to this instance (using ssh).

 1. Install [http://java.sun.com/javase/downloads/index.jsp Java].  Copy the link of the "Linux self-extracting file" from Sun's download page, then use '''wget''' to retrieve the JVM.  Unpack this in a well-known location, like /usr/local. {{{
# cd /usr/local
# wget -O java.bin http://.../jdk-1_5_0_09-linux-i586.bin
# sh java.bin
# rm java.bin
}}}

 1. Install rsync. {{{
# yum install rsync
}}}

 1. (Optional) install other tools you might need. {{{
# yum install emacs
# yum install subversion
}}}  To install [http://ant.apache.org/ Ant], cut the download URL and then: {{{
# cd /usr/local
# wget http://.../apache-ant-1.6.5-bin.tar.gz
# tar xzf apache-ant-1.6.5-bin.tar.gz
# rm apache-ant-1.6.5-bin.tar.gz
}}}

 1. Install Hadoop.{{{
# cd /usr/local
# wget http://.../hadoop-X.X.X.tar.gz
# tar xzf hadoop-X.X.X.tar.gz
# rm hadoop-X.X.X.tar.gz
}}}

 1. Configure Hadoop (described below).

 1. Add executables to your PATH and perform other configurations that will make this system easy for you to use.

 1. Save the image, using Amazon's instructions.

== Configuring Hadoop ==

Hadoop is configured with a single master node and many slave nodes.  To facilliate re-deployment without re-configuration, one may register a name in DNS for the master host, then reset the address for this name each time the cluster is re-deployed.  Services such as [http://www.dyndns.com/services/dns/dyndns/ DynDNS] make this fairly easy.  In the following, we refer to the master as '''master.mydomain.com'''.  Please replace this with your actual master node's name.

In EC2, the local data volume is mounted as '''/mnt'''.


=== hadoop-env.sh ===

{{{
# Set Hadoop-specific environment variables here.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/local/jdk1.5.0_09

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/mnt/hadoop/logs

# host:path where hadoop code should be rsync'd from.  Unset by default.
export HADOOP_MASTER=master.mydomain.com:/usr/local/hadoop-X.X.X

# Seconds to sleep between slave commands.  Unset by default.  This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
export HADOOP_SLAVE_SLEEP=1
}}}

{{{
% mkdir -p /mnt/hadoop/logs
}}}

=== hadoop-site.xml ===

All of Hadoop's local data is stored relative to '''hadoop.tmp.dir''', so we only need specify this, plus the name of the master node for DFS (the NameNode) and MapReduce (the JobTracker).

{{{
<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/hadoop</value>
</property>

<property>
  <name>fs.default.name</name>
  <value>master.mydomain.com:50001</value>
</property>

<property>
  <name>mapred.job.tracker</name>
  <value>master.mydomain.com:50002</value>
</property>

</configuration>
}}}


=== mapred-default.xml ===

This should vary with the size of your cluster.  Typically '''mapred.map.tasks''' should be 10x the number of instances, and '''mapred.reduce.tasks''' should be 3x the number of instances.  The following is thus configured for a 19-instance cluster.

{{{
<configuration>

<property>
  <name>mapred.map.tasks</name>
  <value>190</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>57</value>
</property>

</configuration>
}}}

== Running your cluster ===