hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "AmazonEC2" by DougCutting
Date Fri, 27 Oct 2006 17:58:08 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by DougCutting:

New page:
= Running Hadoop on Amazon EC2 =

[http://www.amazon.com/gp/browse.html?node=201590011 Amazon EC2] is a computing service. 
One allocates a set of hosts, and runs ones's application on them, then, when done, returns
the hosts.  Billing is hourly per host.  Thus EC2 permits one to deploy Hadoop on a cluster
without having to own and operate a cluster, but rather renting it on an hourly basis.

== Concepts ==

 * '''Abstract Machine Image (AMI)''', or ''image''.  A bootable Linux image, with software
 * '''instance'''.  A host running an AMI.
== Building an Image ==

To use Hadoop it is easiest to build an image with most of the software you require already
installed.  Amazon provides [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=354&categoryID=87
good documentation] for building images.  Follow the tutorial here to learn how to install
the EC2 command-line tools, etc.

To build an image for Hadoop:

 1. Start an instance of the fedora base image.

 1. Login to this instance (using ssh).

 1. Install [http://java.sun.com/javase/downloads/index.jsp Java].  Copy the link of the "Linux
self-extracting file" from Sun's download page, then use '''wget''' to retrieve the JVM. 
Unpack this in a well-known location, like /usr/local. {{{
# cd /usr/local
# wget -O java.bin http://.../jdk-1_5_0_09-linux-i586.bin
# sh java.bin
# rm java.bin

 1. Install rsync. {{{
# yum install rsync

 1. (Optional) install other tools you might need. {{{
# yum install emacs
# yum install subversion
}}}  To install [http://ant.apache.org/ Ant], cut the download URL and then: {{{
# cd /usr/local
# wget http://.../apache-ant-1.6.5-bin.tar.gz
# tar xzf apache-ant-1.6.5-bin.tar.gz
# rm apache-ant-1.6.5-bin.tar.gz

 1. Install Hadoop.{{{
# cd /usr/local
# wget http://.../hadoop-X.X.X.tar.gz
# tar xzf hadoop-X.X.X.tar.gz
# rm hadoop-X.X.X.tar.gz

 1. Configure Hadoop (described below).

 1. Add executables to your PATH and perform other configurations that will make this system
easy for you to use.

 1. Save the image, using Amazon's instructions.

== Configuring Hadoop ==

Hadoop is configured with a single master node and many slave nodes.  To facilliate re-deployment
without re-configuration, one may register a name in DNS for the master host, then reset the
address for this name each time the cluster is re-deployed.  Services such as [http://www.dyndns.com/services/dns/dyndns/
DynDNS] make this fairly easy.  In the following, we refer to the master as '''master.mydomain.com'''.
 Please replace this with your actual master node's name.

In EC2, the local data volume is mounted as '''/mnt'''.

=== hadoop-env.sh ===

# Set Hadoop-specific environment variables here.

# The java implementation to use.  Required.
export JAVA_HOME=/usr/local/jdk1.5.0_09

# Where log files are stored.  $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/mnt/hadoop/logs

# host:path where hadoop code should be rsync'd from.  Unset by default.
export HADOOP_MASTER=master.mydomain.com:/usr/local/hadoop-X.X.X

# Seconds to sleep between slave commands.  Unset by default.  This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.

% mkdir -p /mnt/hadoop/logs

=== hadoop-site.xml ===

All of Hadoop's local data is stored relative to '''hadoop.tmp.dir''', so we only need specify
this, plus the name of the master node for DFS (the NameNode) and MapReduce (the JobTracker).






=== mapred-default.xml ===

This should vary with the size of your cluster.  Typically '''mapred.map.tasks''' should be
10x the number of instances, and '''mapred.reduce.tasks''' should be 3x the number of instances.
 The following is thus configured for a 19-instance cluster.





== Running your cluster ===

View raw message