Return-Path: Delivered-To: apmail-lucene-hadoop-commits-archive@locus.apache.org Received: (qmail 50102 invoked from network); 27 Oct 2006 17:58:30 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Oct 2006 17:58:30 -0000 Received: (qmail 67715 invoked by uid 500); 27 Oct 2006 17:58:42 -0000 Delivered-To: apmail-lucene-hadoop-commits-archive@lucene.apache.org Received: (qmail 67695 invoked by uid 500); 27 Oct 2006 17:58:41 -0000 Mailing-List: contact hadoop-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-commits@lucene.apache.org Received: (qmail 67686 invoked by uid 99); 27 Oct 2006 17:58:41 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Oct 2006 10:58:41 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Oct 2006 10:58:29 -0700 Received: from ajax.apache.org (localhost [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 97B51D49BE for ; Fri, 27 Oct 2006 18:58:08 +0100 (BST) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: hadoop-commits@lucene.apache.org Date: Fri, 27 Oct 2006 17:58:08 -0000 Message-ID: <20061027175808.8879.19183@ajax.apache.org> Subject: [Lucene-hadoop Wiki] Update of "AmazonEC2" by DougCutting X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification. The following page has been changed by DougCutting: http://wiki.apache.org/lucene-hadoop/AmazonEC2 New page: = Running Hadoop on Amazon EC2 = [http://www.amazon.com/gp/browse.html?node=201590011 Amazon EC2] is a computing service. One allocates a set of hosts, and runs ones's application on them, then, when done, returns the hosts. Billing is hourly per host. Thus EC2 permits one to deploy Hadoop on a cluster without having to own and operate a cluster, but rather renting it on an hourly basis. == Concepts == * '''Abstract Machine Image (AMI)''', or ''image''. A bootable Linux image, with software pre-installed. * '''instance'''. A host running an AMI. == Building an Image == To use Hadoop it is easiest to build an image with most of the software you require already installed. Amazon provides [http://developer.amazonwebservices.com/connect/entry.jspa?externalID=354&categoryID=87 good documentation] for building images. Follow the tutorial here to learn how to install the EC2 command-line tools, etc. To build an image for Hadoop: 1. Start an instance of the fedora base image. 1. Login to this instance (using ssh). 1. Install [http://java.sun.com/javase/downloads/index.jsp Java]. Copy the link of the "Linux self-extracting file" from Sun's download page, then use '''wget''' to retrieve the JVM. Unpack this in a well-known location, like /usr/local. {{{ # cd /usr/local # wget -O java.bin http://.../jdk-1_5_0_09-linux-i586.bin # sh java.bin # rm java.bin }}} 1. Install rsync. {{{ # yum install rsync }}} 1. (Optional) install other tools you might need. {{{ # yum install emacs # yum install subversion }}} To install [http://ant.apache.org/ Ant], cut the download URL and then: {{{ # cd /usr/local # wget http://.../apache-ant-1.6.5-bin.tar.gz # tar xzf apache-ant-1.6.5-bin.tar.gz # rm apache-ant-1.6.5-bin.tar.gz }}} 1. Install Hadoop.{{{ # cd /usr/local # wget http://.../hadoop-X.X.X.tar.gz # tar xzf hadoop-X.X.X.tar.gz # rm hadoop-X.X.X.tar.gz }}} 1. Configure Hadoop (described below). 1. Add executables to your PATH and perform other configurations that will make this system easy for you to use. 1. Save the image, using Amazon's instructions. == Configuring Hadoop == Hadoop is configured with a single master node and many slave nodes. To facilliate re-deployment without re-configuration, one may register a name in DNS for the master host, then reset the address for this name each time the cluster is re-deployed. Services such as [http://www.dyndns.com/services/dns/dyndns/ DynDNS] make this fairly easy. In the following, we refer to the master as '''master.mydomain.com'''. Please replace this with your actual master node's name. In EC2, the local data volume is mounted as '''/mnt'''. === hadoop-env.sh === {{{ # Set Hadoop-specific environment variables here. # The java implementation to use. Required. export JAVA_HOME=/usr/local/jdk1.5.0_09 # Where log files are stored. $HADOOP_HOME/logs by default. export HADOOP_LOG_DIR=/mnt/hadoop/logs # host:path where hadoop code should be rsync'd from. Unset by default. export HADOOP_MASTER=master.mydomain.com:/usr/local/hadoop-X.X.X # Seconds to sleep between slave commands. Unset by default. This # can be useful in large clusters, where, e.g., slave rsyncs can # otherwise arrive faster than the master can service them. export HADOOP_SLAVE_SLEEP=1 }}} {{{ % mkdir -p /mnt/hadoop/logs }}} === hadoop-site.xml === All of Hadoop's local data is stored relative to '''hadoop.tmp.dir''', so we only need specify this, plus the name of the master node for DFS (the NameNode) and MapReduce (the JobTracker). {{{ hadoop.tmp.dir /mnt/hadoop fs.default.name master.mydomain.com:50001 mapred.job.tracker master.mydomain.com:50002 }}} === mapred-default.xml === This should vary with the size of your cluster. Typically '''mapred.map.tasks''' should be 10x the number of instances, and '''mapred.reduce.tasks''' should be 3x the number of instances. The following is thus configured for a 19-instance cluster. {{{ mapred.map.tasks 190 mapred.reduce.tasks 57 }}} == Running your cluster ===