Return-Path: Delivered-To: apmail-incubator-whirr-commits-archive@minotaur.apache.org Received: (qmail 65319 invoked from network); 7 Oct 2010 18:31:25 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 7 Oct 2010 18:31:25 -0000 Received: (qmail 37938 invoked by uid 500); 7 Oct 2010 18:31:24 -0000 Delivered-To: apmail-incubator-whirr-commits-archive@incubator.apache.org Received: (qmail 37918 invoked by uid 500); 7 Oct 2010 18:31:24 -0000 Mailing-List: contact whirr-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: whirr-dev@incubator.apache.org Delivered-To: mailing list whirr-commits@incubator.apache.org Received: (qmail 37911 invoked by uid 99); 7 Oct 2010 18:31:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 18:31:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Oct 2010 18:31:17 +0000 Received: by eris.apache.org (Postfix, from userid 65534) id 907DA238890D; Thu, 7 Oct 2010 18:30:55 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r1005565 - in /incubator/whirr/trunk: ./ src/site/ src/site/confluence/ src/site/confluence/contrib/ src/site/confluence/contrib/python/ Date: Thu, 07 Oct 2010 18:30:55 -0000 To: whirr-commits@incubator.apache.org From: tomwhite@apache.org X-Mailer: svnmailer-1.0.8 Message-Id: <20101007183055.907DA238890D@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: tomwhite Date: Thu Oct 7 18:30:54 2010 New Revision: 1005565 URL: http://svn.apache.org/viewvc?rev=1005565&view=rev Log: WHIRR-112. Expand documentation. Added: incubator/whirr/trunk/src/site/confluence/api-guide.confluence incubator/whirr/trunk/src/site/confluence/contrib/ incubator/whirr/trunk/src/site/confluence/contrib/python/ incubator/whirr/trunk/src/site/confluence/contrib/python/automatically-shutting-down-a-cluster.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/configuring-and-running.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/installation.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/launching-a-cluster.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/running-mapreduce-jobs.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/running-zookeeper.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/terminating-a-cluster.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/using-command-line-options.confluence incubator/whirr/trunk/src/site/confluence/contrib/python/using-persistent-clusters.confluence incubator/whirr/trunk/src/site/confluence/faq.confluence Modified: incubator/whirr/trunk/CHANGES.txt incubator/whirr/trunk/src/site/confluence/configuration-guide.confluence incubator/whirr/trunk/src/site/confluence/index.confluence incubator/whirr/trunk/src/site/confluence/quick-start-guide.confluence incubator/whirr/trunk/src/site/site.xml Modified: incubator/whirr/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/CHANGES.txt?rev=1005565&r1=1005564&r2=1005565&view=diff ============================================================================== --- incubator/whirr/trunk/CHANGES.txt (original) +++ incubator/whirr/trunk/CHANGES.txt Thu Oct 7 18:30:54 2010 @@ -26,6 +26,8 @@ Trunk (unreleased changes) WHIRR-110. Create client-side Hadoop configuration file during cluster launch. (tomwhite) + WHIRR-112. Expand documentation. (tomwhite) + BUG FIXES WHIRR-93. Fail on checkstyle violation. (tomwhite) Added: incubator/whirr/trunk/src/site/confluence/api-guide.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/api-guide.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/api-guide.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/api-guide.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,7 @@ +h1. API Guide + +Whirr provides a Java API for stopping and starting clusters. Please see the +[javadoc|apidocs/index.html] and the unit test source code for how to +achieve this. + +There's also some example code at [http://github.com/hammer/whirr-demo]. \ No newline at end of file Modified: incubator/whirr/trunk/src/site/confluence/configuration-guide.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/configuration-guide.confluence?rev=1005565&r1=1005564&r2=1005565&view=diff ============================================================================== --- incubator/whirr/trunk/src/site/confluence/configuration-guide.confluence (original) +++ incubator/whirr/trunk/src/site/confluence/configuration-guide.confluence Thu Oct 7 18:30:54 2010 @@ -1,50 +1,51 @@ -h1. Java +h1. Configuration Guide Whirr is configured using a properties file, and optionally using command line arguments when using the CLI. Command line arguments take precedence over properties specified in a properties file. || Name || Command line option || Default || Description || -| {{config}} | {{\--config}} | none | A filename of a properties file containing properties in this table. | -| {{service-name}} | {{\--service-name}} | none | The name of the service to use. E.g. {{hadoop}} | -| {{cluster-name}} | {{\--cluster-name}} | none | The name of the cluster to operate on. E.g. {{hadoopcluster}}. The cluster name is used to tag the instances in some cloud-specific way. For example, in Amazon it is used to form the security group name. | -| {{instance-templates}} | {{\--instance-templates}} | none | The number of instances to launch for each set of roles. E.g. {{1 nn+jt 1 dn+tt}}. | -| {{provider}} | {{\--provider}} | {{ec2}} | The name of the cloud provider. | -| {{identity}} | {{\--identity}} | none | The cloud identity. See the table below for how this maps to the credentials for your provider. | -| {{credential}} | {{\--credential}} | none | The cloud credential. See the table below for how this maps to the credentials for your provider. | -| {{secret-key-file}} | {{\--secret-key-file}} | _\~/.ssh/id\_rsa_ | The filename of the private key used to connect to instances. | -| {{image-id}} | {{\--image-id}} | none | The ID of the image to use for instances. If not specified then a vanilla Ubuntu image is chosen. | -| {{hardware-id}} | {{\--hardware-id}} | none | The type of hardware to use for the instance. This must be compatible with the image ID. | -| {{location-id}} | {{\--location-id}} | none | The location to launch instances in. If not specified then an arbitrary location will be chosen. | -| {{client-cidrs}} | {{\--client-cidrs}} | none | A comma-separated list of [CIDR |http://en.wikipedia.org/wiki/Classless\_Inter-Domain\_Routing] blocks. E.g. {{208.128.0.0/11,108.128.0.0/11}} | -| {{run-url-base}} | {{\--run-url-base}} | {{http://whirr.s3.amazonaws.com/VERSION/}} | The base URL for forming run urls from. Change this to host your own set of launch scripts. | +| {{whirr.config}} | {{\--config}} | none | A filename of a properties file containing properties in this table. Note that Whirr properties specified in this file all have a {{whirr.}} prefix. | +| {{whirr.service-name}} | {{\--service-name}} | none | The name of the service to use. E.g. {{hadoop}}. | +| {{whirr.cluster-name}} | {{\--cluster-name}} | none | The name of the cluster to operate on. E.g. {{hadoopcluster}}. The cluster name is used to tag the instances in some cloud-specific way. For example, in Amazon it is used to form the security group name. | +| {{whirr.instance-templates}} | {{\--instance-templates}} | none | The number of instances to launch for each set of roles. E.g. {{1 nn+jt,10 dn+tt}} means one instance with the roles {{nn}} (namenode) and {{jt}} (jobtracker), and ten instances each with the roles {{dn}} (datanode) and {{tt}} (tasktracker). | +| {{whirr.provider}} | {{\--provider}} | {{ec2}} | The name of the cloud provider. See the [table below|#cloud-provider-config] for possible provider names.| +| {{whirr.identity}} | {{\--identity}} | none | The cloud identity. See the [table below|#cloud-provider-config] for how this maps to the credentials for your provider. | +| {{whirr.credential}} | {{\--credential}} | none | The cloud credential. See the [table below|#cloud-provider-config] for how this maps to the credentials for your provider. | +| {{whirr.private-key-file}} | {{\--private-key-file}} | _\~/.ssh/id\_rsa_ | The filename of the private key used to connect to instances. | +| {{whirr.public-key-file}} | {{\--public-key-file}} | _\~/.ssh/id\_rsa_.pub | The filename of the public key used to connect to instances. | +| {{whirr.image-id}} | {{\--image-id}} | none | The ID of the image to use for instances. If not specified then a vanilla Linux image is chosen. | +| {{whirr.hardware-id}} | {{\--hardware-id}} | none | The type of hardware to use for the instance. This must be compatible with the image ID. | +| {{whirr.location-id}} | {{\--location-id}} | none | The location to launch instances in. If not specified then an arbitrary location will be chosen. | +| {{whirr.client-cidrs}} | {{\--client-cidrs}} | none | A comma-separated list of [CIDR |http://en.wikipedia.org/wiki/Classless\_Inter-Domain\_Routing] blocks. E.g. {{208.128.0.0/11,108.128.0.0/11}} | +| {{whirr.run-url-base}} | {{\--run-url-base}} | {{http://whirr.s3.amazonaws.com/VERSION/}} | The base URL for forming run urls from. Change this to host your own set of launch scripts, as explained in the [FAQ|faq#how-can-i-modify-the-instance-installation-and-configuration-scripts]. | +{anchor:cloud-provider-config} h2. Cloud provider specific configuration -|| Compute Service Provider || {{provider}} || {{identity}} || {{credential}} || Cluster name usage || -| Amazon EC2 | {{ec2}} | Access Key ID | Secret Access Key | Used to form security Group (via jclouds tag) | -| Rackspace Cloud Servers | {{cloudservers}} | Username | API Key | Server name ({{--}}) | +|| Compute Service Provider || {{whirr.provider}} || {{whirr.identity}} || {{whirr.credential}} || Cluster name usage || Notes || +| Amazon EC2 | {{ec2}} | Access Key ID | Secret Access Key | Used to form security Group (via jclouds tag) | | +| Rackspace Cloud Servers | {{cloudservers}} | Username | API Key | Server name ({{--}}) | Rackspace is not yet supported | -h1. Python +{anchor:comparison-with-python} +h1. Comparison with Python -See [https://docs.cloudera.com/display/DOC/Using+Command+Line+Options] - -h2. Comparison +See [Using Command Line Options|contrib/python/using-command-line-options]. || Python || Java || Notes || -| {{config-dir}} | {{config}} | | -| {{service}} | {{service-name}} | | -| none | {{cluster-name}} | Specified as a positional argument on the Python CLI. | -| none | {{instance-templates}} | Specified as a positional arguments on the Python CLI. | -| {{cloud-provider}} | {{provider}} | | -| none | {{identity}} | Specified using environment variables for Python. E.g. {{AWS\_ACCESS\_KEY\_ID}}, {{RACKSPACE\_KEY}} | -| none | {{credential}} | Specified using environment variables for Python. E.g. {{AWS\_ACCESS\_KEY\_ID}}, {{RACKSPACE\_SECRET}} | -| {{private-key-file}} | {{private-key-file}} | | -| {{client-cidr}} | {{client-cidrs}} | Python {{client-cidr}} option may be repeated multiple times, whereas Java {{client-cidrs}} contains comma-separated CIDRs. | -| none | {{run-url-base}} | Specified using {{user-data-file}} in Python. | -| {{public-key}} | none | Based on secret key in Java (add {{.pub}}). | -| {{image-id}} | {{image-id}} | | -| {{instance-type}} | {{hardware-id}} | | -| {{availability-zone}} | {{location-id}} | Location is more general than availability zone. | -| {{security-group}} | none | Amazon-specific. However, Amazon users may wish to start a cluster in additional security groups. | +| {{config-dir}} | {{whirr.config}} | | +| {{service}} | {{whirr.service-name}} | | +| none | {{whirr.cluster-name}} | Specified as a positional argument on the Python CLI. | +| none | {{whirr.instance-templates}} | Specified as a positional arguments on the Python CLI. | +| {{cloud-provider}} | {{whirr.provider}} | | +| none | {{whirr.identity}} | Specified using environment variables for Python. E.g. {{AWS\_ACCESS\_KEY\_ID}}, {{RACKSPACE\_KEY}} | +| none | {{whirr.credential}} | Specified using environment variables for Python. E.g. {{AWS\_ACCESS\_KEY\_ID}}, {{RACKSPACE\_SECRET}} | +| {{private-key}} | {{whirr.private-key-file}} | | +| {{public-key}} | {{whirr.public-key-file}} | | +| {{client-cidr}} | {{whirr.client-cidrs}} | Python's {{client-cidr}} option may be repeated multiple times, whereas Java's {{whirr.client-cidrs}} contains comma-separated CIDRs. | +| none | {{whirr.run-url-base}} | Specified using {{user-data-file}} in Python. | +| {{image-id}} | {{whirr.image-id}} | | +| {{instance-type}} | {{whirr.hardware-id}} | | +| {{availability-zone}} | {{whirr.location-id}} | Location is more general than availability zone. | +| {{security-group}} | none | Amazon-specific. However, Amazon users may wish to start a cluster in additional security groups, which isn't currently supported in Java. | | {{env}} | none | May not be needed in Java with runurls. | | {{user-data-file}} | none | Amazon-specific. Use runurls. | | {{key-name}} | none | Amazon-specific. Jclouds generates a new key for clusters. | Added: incubator/whirr/trunk/src/site/confluence/contrib/python/automatically-shutting-down-a-cluster.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/automatically-shutting-down-a-cluster.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/automatically-shutting-down-a-cluster.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/automatically-shutting-down-a-cluster.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,21 @@ +h1. Automatically Shutting Down a Cluster + +You can use the {{--auto-shutdown}} option to automatically terminate a cluster +at a specified number of minutes after launch. This is useful for short-lived +clusters where the jobs complete in a known amount of time. + +*To configure the automatic shutdown (for example, 50 minutes after launch):* +{code} +hadoop-ec2 launch-cluster --auto-shutdown 50 my-hadoop-cluster 2 +{code} + +You can also use the configuration property {{auto\_shutdown}} +in the configuration file; for example, to shut down 50 minutes after launch, +you would use {{auto\_shutdown=50}}. + +*To cancel the automatic shutdown:* +{code} +% hadoop-ec2 exec my-hadoop-cluster shutdown -c +% hadoop-ec2 update-slaves-file my-hadoop-cluster +% hadoop-ec2 exec my-hadoop-cluster /usr/lib/hadoop/bin/slaves.sh shutdown -c +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/configuring-and-running.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/configuring-and-running.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/configuring-and-running.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/configuring-and-running.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,69 @@ +h1. Configuring and Running + +h2. Setting Environment Variables to Specify AWS Credentials + +You must specify your AWS credentials when using the cloud scripts (see + [How do I find my cloud credentials?|../../faq#how-do-i-find-my-cloud-credentials]). The +simplest way to do this is to set the environment variables (see +[this page|http://code.google.com/p/boto/wiki/BotoConfig] for other options): +* {{AWS\_ACCESS\_KEY\_ID}}: Your AWS Access Key ID +* {{AWS\_SECRET\_ACCESS\_KEY}}: Your AWS Secret Access Key + +h2. Configuring the Python Cloud Scripts + +To configure the scripts, create a directory called _.hadoop-cloud_ in your +home directory (note the leading period "."). In that directory, create a file +called _clusters.cfg_ that contains a section for each cluster you want to +control. The following example shows how to specify an i386 Ubuntu OS as the +AMI in a _clusters.cfg_ file. + +{code} +[my-hadoop-cluster] +image_id=ami-ed59bf84 +instance_type=c1.medium +key_name=tom +availability_zone=us-east-1c +private_key=/path/to/private/key/file +ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no +{code} + +You can select a suitable AMI from the following table: + +|| AMI (bucket/name) || ID || OS || +| cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-i386 | ami-ed59bf84 | Ubuntu 8.10 (Intrepid) | +| cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-x86_64 | ami-8759bfee | Ubuntu 8.10 (Intrepid) | +| cloudera-ec2-hadoop-images/cloudera-hadoop-fedora-20090623-i386 | ami-6159bf08 | Fedora release 8 (Werewolf) | +| cloudera-ec2-hadoop-images/cloudera-hadoop-fedora-20090623-x86_64 | ami-2359bf4a | Fedora release 8 (Werewolf) | + +*The architecture must be compatible with the instance type. For {{m1.small}} and {{c1.medium}} instances, use the i386 AMIs. For {{m1.large}}, {{m1.xlarge}}, and {{c1.xlarge}} instances, use the x86_64 AMIs. One of the high CPU instances ({{c1.medium}} or {{c1.xlarge}}) is recommended.* + +If you wish to use [CDH|http://www.cloudera.com/hadoop/] instead of Apache +Hadoop, use the following configuration: + +{code} +[my-hadoop-cluster] +image_id=ami-2d4aa444 +instance_type=c1.medium +key_name=tom +availability_zone=us-east-1c +private_key=/path/to/private/key/file +ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no +user_data_file=http://archive.cloudera.com/cloud/ec2/cdh3/hadoop-ec2-init-remote.sh +{code} + +Note that this example uses CDH3, as specified by the {{user\_data\_file}} +property (the version of Hadoop to install is determined by this script). +For CDH, use one of the AMIs from this table: + +|| AMI (bucket/name) || ID || OS || Notes || +| ubuntu-images/ubuntu-lucid-10.04-i386-server-20100427.1 | ami-2d4aa444 | Ubuntu 10.04 (Lucid) | This AMI is suitable for use with CDH3b2 onwards. See http://alestic.com/ | +| ubuntu-images/ubuntu-lucid-10.04-amd64-server-20100427.1 | ami-fd4aa494 | Ubuntu 10.04 (Lucid) | This AMI is suitable for use with CDH3b2 onwards. See http://alestic.com/ | + +h2. {anchor:running-a-basic-cloud-script}Running a Basic Cloud Script + +After specifying an AMI, you can run the {{hadoop-ec2}} script. It will display usage instructions when you invoke it without arguments. + +You can test that the script can connect to your cloud provider by typing: +{code} +% hadoop-ec2 list +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/installation.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/installation.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/installation.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/installation.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,40 @@ +h1. Installation + +The Python cloud scripts enable you to run Hadoop on cloud providers. +A working cluster will start immediately with one command. It's ideal for +running temporary Hadoop clusters to carry out a proof of concept, or to run a +few one-time jobs. Currently, the scripts support Amazon EC2 only, but in the +future other cloud providers may also be supported. + +Amazon Machine Images (AMIs) and associated launch scripts are provided that +make it easy to run Hadoop on EC2. Note that the AMIs contain only base packages +(such as Java), and not a particular version of Hadoop because Hadoop is +installed at launch time. + +*In this section, command lines that start with {{#}} are executed on a cloud +instance, and command lines starting with a {{%}} are executed on your +workstation.* + +h2. Installing the Python Cloud Scripts + +The following prerequisites apply to using the Python cloud scripts: +* Python 2.5 +* boto 1.8d +* simplejson 2.0.9 + +You can install bot and simplejson by using +[easy\_install|http://pypi.python.org/pypi/setuptools]: +{code} +% easy_install "simplejson==2.0.9" +% easy_install "boto==1.8d" +{code} + +*NOTE: If you have both Python 2.5 and 2.6 on your system (e.g. OS X Snow Leopard), then you should use {{easy_install-2.5}}.* + +Alternatively, you might like to use the python-boto and python-simplejson RPM +and Debian packages. + +The Python Cloud scripts are packaged in the source tarball. Unpack the tarball +on your system. The CDH Cloud scripts are in _contrib/python/src/py_. +For convenience, you can add this directory to your path. + Added: incubator/whirr/trunk/src/site/confluence/contrib/python/launching-a-cluster.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/launching-a-cluster.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/launching-a-cluster.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/launching-a-cluster.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,59 @@ +h1. Launching a Cluster + +After you install the client scripts and enter your EC2 account information, +starting a Hadoop cluster with 10 nodes is easy with a single command. +\\ +\\ +To launch a cluster called "my-hadoop-cluster" with 10 worker (slave) nodes, use this command: +{code} +% hadoop-ec2 launch-cluster my-hadoop-cluster 10 +{code} +\\ +This command boots the master node and 10 worker nodes. The master node runs the +Namenode, secondary Namenode, and Jobtracker, and each worker node runs a +Datanode and a Tasktracker. + +Equivalently, you can launch the cluster by using this command syntax: +{code} +% hadoop-ec2 launch-cluster my-hadoop-cluster 1 nn,snn,jt 10 dn,tt +{code} +Note that by using this syntax, you can also launch a split Namenode/Jobtracker +cluster. For example: +{code} +% hadoop-ec2 launch-cluster my-hadoop-cluster 1 nn,snn 1 jt 10 dn,tt +{code} +After the nodes have started and the Hadoop cluster is operational, the console +will display a message such as: +{code} +Browse the cluster at http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com/ +{code} +You can access Hadoop's web UI at the URL in the message. By default, port 80 is +opened for access from your client machine. You can change the firewall settings +(to allow access from a network, rather than just a single machine, for example) +by using the Amazon EC2 command line tools, or by using a tool such as +[Elastic Fox|http://developer.amazonwebservices.com/connect/entry.jspa?externalID=609]. +The security group to change is the one named {{-}}. +For example, for the Namenode in the cluster started above, it would be +{{my-hadoop-cluster-nn}}. + +For security reasons, traffic from the network your client is running on is +proxied through the master node of the cluster using an SSH tunnel (a SOCKS +proxy on port 6666). + +To set up the proxy, run the following command: +{code} +% eval `hadoop-ec2 proxy my-hadoop-cluster` +{code} +Note the backticks, which are used to evaluate the result of the command. This +allows you to stop the proxy later on (from the same terminal): +{code} +% kill $HADOOP_CLOUD_PROXY_PID +{code} +Web browsers need to be configured to use this proxy too, so you can view pages +served by worker nodes in the cluster. The most convenient way to do this is to +use a [proxy auto-config (PAC) file|http://en.wikipedia.org/wiki/Proxy\_auto-config] +file, such as [this one|http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac] +for Hadoop EC2 clusters. + +If you are using Firefox, then you may find +[FoxyProxy|http://foxyproxy.mozdev.org/] useful for managing PAC files. \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/running-mapreduce-jobs.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/running-mapreduce-jobs.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/running-mapreduce-jobs.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/running-mapreduce-jobs.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,43 @@ +h1. Running MapReduce Jobs + +After you launch a cluster, a {{hadoop-site.xml}} file is created in the +directory {{~/.hadoop-cloud/}}. You can use this to connect to +the cluster by setting the {{HADOOP\_CONF\_DIR}} environment variable. (It is +also possible to set the configuration file to use by passing it as a {{-conf}} +option to Hadoop Tools): +{code} +% export HADOOP_CONF_DIR=~/.hadoop-cloud/my-hadoop-cluster +{code} +*To browse HDFS:* +{code} +% hadoop fs -ls / +{code} +Note that the version of Hadoop installed locally should match the version +installed on the cluster. +\\ +\\ +*To run a job locally:* +{code} +% hadoop fs -mkdir input # create an input directory +% hadoop fs -put $HADOOP_HOME/LICENSE.txt input # copy a file there +% hadoop jar $HADOOP_HOME/hadoop-*examples*.jar wordcount input output +% hadoop fs -cat output/part-* | head +{code} +The preceding examples assume that you installed Hadoop on your local machine. +But you can also run jobs within the cluster. +\\ +\\ +*To run jobs within the cluster:* + +1. Log into the Namenode: +{code} +% hadoop-ec2 login my-hadoop-cluster +{code} + +2. Run the job: +{code} +# hadoop fs -mkdir input +# hadoop fs -put /etc/hadoop/conf/*.xml input +# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output 'dfs\[a-z.]+' +# hadoop fs -cat output/part-* | head +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/running-zookeeper.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/running-zookeeper.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/running-zookeeper.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/running-zookeeper.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,21 @@ +h1. Running Apache ZooKeeper + +The main use of the Python Cloud Scripts is to run Hadoop clusters, but you can +also run other services such as Apache ZooKeeper. + +*To run Apache ZooKeeper, set the {{service}} parameter to {{zookeeper}}:* +{code} +[my-zookeeper-cluster] +service=zookeeper +ami=ami-ed59bf84 +instance_type=m1.small +key_name=tom +availability_zone=us-east-1c +public_key=/path/to/public/key/file +private_key=/path/to/private/key/file +{code} + +*To launch a three-node ZooKeeper ensemble:* +{code} +% ./hadoop-ec2 launch-cluster my-zookeeper-cluster 3 zk +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/terminating-a-cluster.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/terminating-a-cluster.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/terminating-a-cluster.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/terminating-a-cluster.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,15 @@ +h1. Terminating a Cluster + +When you are done using your cluster, you can terminate all instances in it. + +*WARNING: All data will be deleted when you terminate the cluster, unless you are using EBS.* + +*To terminate a cluster:* +{code} +% hadoop-ec2 terminate-cluster my-hadoop-cluster +{code} + +*To delete the EC2 security groups:* +{code} +% hadoop-ec2 delete-cluster my-hadoop-cluster +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/contrib/python/using-command-line-options.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/using-command-line-options.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/using-command-line-options.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/using-command-line-options.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,77 @@ +h1. Using Command Line Options + +It is possible to specify options on the command line when you launch a cluster. +The options take precedence over any settings specified in the configuration +file. + +For example, the following command launches a 10-node cluster using a specified +image and instance type, overriding the equivalent settings (if any) that are +in the {{my-hadoop-cluster}} section of the configuration file. Note that words +in options are separated by hyphens ({{--instance-type}}) while the +corresponding configuration parameter are separated by underscores +({{instance\_type}}). +{code} +% hadoop-ec2 launch-cluster --image-id ami-2359bf4a --instance-type c1.xlarge \ +my-hadoop-cluster 10 +{code} + +If there options are that you want to specify multiple times, you can set them +in the configuration file by separating them with newlines (and leading +whitespace). For example: +{code} +env=AWS_ACCESS_KEY_ID=... + AWS_SECRET_ACCESS_KEY=... +{code} + +The scripts install Hadoop from a tarball (or, in the case of CDH, from RPMs or +Debian packages, depending on the OS) at instance boot time. + +By default, Apache Hadoop 0.20.1 is installed. To run a different version of +Hadoop, change the {{user\_data\_file}} setting. + +For example, to use the latest version of CDH3 add the following parameter: +{code} +--user-data-file http://archive.cloudera.com/cloud/ec2/cdh3/hadoop-ec2-init-remote.sh +{code} +By default, the latest version of the specified CDH release series is used. To +use a particular release of CDH, use the {{REPO env}} parameter, in addition to +setting {{user\_data\_file}}. For example, to specify the Beta 1 release of CDH3: +{code} +--env REPO=cdh3b1 +{code} +For this release, Hadoop configuration files can be found in {{/etc/hadoop/conf}} and logs are in {{/var/log/hadoop}}. + + +h2. Customization + +You can specify a list of packages to install on every instance at boot time by +using the {{--user-packages}} command-line option or the {{user\_packages}} +configuration parameter. Packages should be space-separated. Note that package +names should reflect the package manager being used to install them ({{yum}} or +{{apt-get}} depending on the OS). + +For example, to install RPMs for R and git: +{code} +% hadoop-ec2 launch-cluster --user-packages 'R git-core' my-hadoop-cluster 10 +{code} +You have full control over the script that is run when each instance boots. The +default script, {{hadoop-ec2-init-remote.sh}}, may be used as a starting point +to add extra configuration or customization of the instance. Make a copy of the +script in your home directory, or somewhere similar, and set the +{{--user-data-file}} command-line option (or the {{user\_data\_file}} +configuration parameter) to point to the (modified) copy. This option may also +point to an arbitrary URL, which makes it easy to share scripts. + +For CDH, use the script located at [http://archive.cloudera.com/cloud/ec2/cdh3/hadoop-ec2-init-remote.sh] + +The {{hadoop-ec2}} script will replace {{%ENV%}} in your user data script with +{{USER\_PACKAGES}}, {{AUTO\_SHUTDOWN}}, and {{EBS\_MAPPINGS}}, as well as extra +parameters supplied using the {{--env}} command-line flag. + +Another way of customizing the instance, which may be more appropriate for +larger changes, is to create your own image. + +It's possible to use any image, as long as it satisfies both of the following +conditions: +* Runs (gzip compressed) user data on boot +* Has Java installed Added: incubator/whirr/trunk/src/site/confluence/contrib/python/using-persistent-clusters.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/contrib/python/using-persistent-clusters.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/contrib/python/using-persistent-clusters.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/contrib/python/using-persistent-clusters.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,112 @@ +h1. Using Persistent Clusters + +*Support for Amazon Elastic Block Storage (EBS) is a Beta feature.* + +When not in use, an EBS-cluster can surrender unneeded EC2 instances, then +restart later and continue where it left off. Users no longer need to copy +large volumes of data from S3 to local disk on the EC2 instance; data persists +reliably and independently in Amazon's EBS, saving compute costs. + +*Schematic showing how the cluster is set up:* +!../../images/persistent-ec2.png! + +*To Use Persistent Cluster with EBS Storage* +# Create a new section called {{my-ebs-cluster}} in the +{{\~/.hadoop-cloud/clusters.cfg}} file. +# Create storage for the new cluster by creating a temporary EBS volume of size +100GiB, formatting it, and saving it as a snapshot in S3. This way, you only +have to do the formatting once. + +{code} +% hadoop-ec2 create-formatted-snapshot my-ebs-cluster 100 +{code} + +You create storage for a single Namenode and for two Datanodes. The volumes to +create are described in a JSON spec file, which references the snapshot you +just created. Here is the contents of a JSON file, called +{{my-ebs-cluster-storage-spec.jso}}: + +*Example contents of my-ebs-cluster-storage-spec.json* +{code} +{ + "nn": [ + { + "device": "/dev/sdj", + "mount_point": "/ebs1", + "size_gb": "100", + "snapshot_id": "snap-268e704f" + }, + { + "device": "/dev/sdk", + "mount_point": "/ebs2", + "size_gb": "100", + "snapshot_id": "snap-268e704f" + } + ], + "dn": [ + { + "device": "/dev/sdj", + "mount_point": "/ebs1", + "size_gb": "100", + "snapshot_id": "snap-268e704f" + }, + { + "device": "/dev/sdk", + "mount_point": "/ebs2", + "size_gb": "100", + "snapshot_id": "snap-268e704f" + } + ] +} +{code} + +Each role ({{nn}} and {{dn}}) is the key to an array of volume specifications. +In this example, each role has two devices ({{/dev/sdj}} and {{/dev/sdk}}) with +different mount points, and generated from an EBS snapshot. The snapshot is the +formatted snapshot created earlier, so that the volumes you create are +pre-formatted. The size of the drives must match the size of the snapshot +created earlier. + +*To use this file to create actual volumes:* +{code} +% hadoop-ec2 create-storage my-ebs-cluster nn 1 \ +my-ebs-cluster-storage-spec.json +% hadoop-ec2 create-storage my-ebs-cluster dn 2 \ +my-ebs-cluster-storage-spec.json +{code} + +*To start the cluster with two slave nodes:* +{code} +% hadoop-ec2 launch-cluster my-ebs-cluster 1 nn,snn,jt 2 dn,tt +{code} + +*To login and run a job which creates some output:* +{code} +% hadoop-ec2 login my-ebs-cluster + +# hadoop fs -mkdir input +# hadoop fs -put /etc/hadoop/conf/*.xml input +# hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output \ +'dfs[a-z.]+' +{code} + +*To view the output:* +{code} +# hadoop fs -cat output/part-* | head +{code} + +*To shutdown the cluster:* +{code} +% hadoop-ec2 terminate-cluster my-ebs-cluster +{code} + +*To restart the cluster and login (after a short delay):* +{code} +% hadoop-ec2 launch-cluster my-ebs-cluster 2 +% hadoop-ec2 login my-ebs-cluster +{code} + +*The output from the job you ran before should still be there:* +{code} +# hadoop fs -cat output/part-* | head +{code} \ No newline at end of file Added: incubator/whirr/trunk/src/site/confluence/faq.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/faq.confluence?rev=1005565&view=auto ============================================================================== --- incubator/whirr/trunk/src/site/confluence/faq.confluence (added) +++ incubator/whirr/trunk/src/site/confluence/faq.confluence Thu Oct 7 18:30:54 2010 @@ -0,0 +1,173 @@ +h1. Frequently Asked Questions + +{anchor:how-do-i-find-my-cloud-credentials} +h2. How do I find my cloud credentials? + +On EC2: +# Go to [http://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key] +# Log in, if prompted +# Find your Access Key ID and Secret Access Key in the "Access Credentials" section, under the "Access Keys" tab. You will have to click "Show" to see the text of your secret access key. + +Another good resource is [Understanding Access Credentials for AWS/EC2|http://alestic.com/2009/11/ec2-credentials] by Eric Hammond. + +h2. How do I access my cluster from a different network? + +By default, access to clusters is restricted to the single IP address of the +machine starting the cluster, as determined by +[Amazon's check IP service|http://checkip.amazonaws.com/]. However, some +networks report multiple origin IP addresses (e.g. they round-robin between +them by connection), which may cause problems if the address used for later +connections is different to the one reported at the time of the first +connection. + +A related problem is when you wish to access the cluster from a different +network to the one it was launched from. + +In these cases you can specify the IP addresses of the machines that may connect +to the cluster by setting the {{client-cidrs}} property to a comma-separated +list of [CIDR|http://en.wikipedia.org/wiki/Classless\_Inter-Domain\_Routing] +blocks. + +For example, {{208.128.0.0/16,38.102.147.107/32}} would allow access from the +{{208.128.0.0}} class B network, and the (single) IP address 38.102.147.107. + +h2. How can I start a cluster in a particular location? + +By default clusters are started in an arbitrary location (e.g. region or +data center). You can control the location by setting {{location-id}} (see the +[configuration guide|configuration-guide] for details). + +For example, in EC2, setting {{location-id}} to {{us-east-1}} would start the +cluster in the US-East region, while setting it to {{us-east-1a}} (note the +final {{a}}) would start the cluster in that particular availability zone +({{us-east-1a}}) in the US-East region. + +h2. How do I log in to a node in the cluster? + +On EC2, if you know the node's address you can do + +{code} +ssh -i ~/.ssh/id_rsa ec2-user@host +{code} + +This assumes that you use the default private key; if this is not the case then +specify the one you used at cluster launch. + +The Amazon Linux AMI requires that you login as {{ec2-user}}. If needed, you can +become root by doing {{sudo su -}} after logging in. + +{anchor:how-can-i-modify-the-instance-installation-and-configuration-scripts} +h2. How can I modify the instance installation and configuration scripts? + +The scripts to install and configure cloud instances are downloaded from an S3 +bucket by the instances at, or after, boot time. The base URL defaults to +{{http://whirr.s3.amazonaws.com/VERSION/}}, where {{VERSION}} is the +version of Whirr. (Note that S3 buckets are not browsable, so you can't +use a browser to look at these scripts unless you know the URL.) + +If you want to change the scripts then you can place a modified copy of the +scripts in the _scripts_ directory of the distribution on a webserver (such as +S3) and change the base URL, by setting the {{run-url-base}} property. + +For example, by setting {{run-url-base}} to {{http://example.org/}} the scripts +would be loaded from the {{example.org}} domain. The Java install script, for +instance, would be requested from {{http://example.org/sun/java/install}}. + +Scripts have to be publicly readable, so on S3 you have to set the ACL to give +everyone read access. [S3Fox|http://www.s3fox.net/] is a useful Firefox +extension for uploading and managing script files on S3. + +You can debug the scripts that run on a cloud instance without having to log +into the instance, since the output is +sent to _whirr.log_ in the directory from which you launched the _whirr_ CLI. + +h2. How do I specify the service version and other service properties? + +Currently the only way to do this is to modify the scripts to install a +particular version of the service, or to change the service properties from +the defaults. + +See "How to modify the instance installation and configuration scripts" above +for details on how to do this. + +h2. How can I install custom packages? + +You can install extra software by modifying the scripts that run on +the cloud instances. See "How to modify the instance installation and +configuration scripts" above. + +h2. How do I run Cloudera's Distribution for Hadoop? + +You can run CDH rather than Apache Hadoop by running the {{hadoop}} service and +setting {{whirr.hadoop-install-runurl}} to {{cloudera/cdh/install}} (the +default is {{apache/hadoop/install}}). Here is a sample configuration: + +{code} +whirr.service-name=hadoop +whirr.cluster-name=myhadoopcluster +whirr.instance-templates=1 jt+nn,1 dn+tt +whirr.provider=ec2 +whirr.identity= +whirr.credential= +whirr.private-key-file=${sys:user.home}/.ssh/id_rsa +whirr.hadoop-install-runurl=cloudera/cdh/install +{code} + +{anchor:other-services} +h2. How do I run a ZooKeeper cluster? + +The {{service-name}} property determines the service to run, for ZooKeeper it +should be set to {{zookeeper}}. Here is a sample ZooKeeper configuration for +running a three-node ensemble: + +{code} +whirr.service-name=zookeeper +whirr.cluster-name=myzkcluster +whirr.instance-templates=3 zk +whirr.provider=ec2 +whirr.identity= +whirr.credential= +whirr.private-key-file=${sys:user.home}/.ssh/id_rsa +{code} + +h2. How do I run a Cassandra cluster? + +The {{service-name}} property determines the service to run, for Cassandra it +should be set to {{cassandra}}. Here is a sample Cassandra configuration for +running a three-node cluster: + +{code} +whirr.service-name=cassandra +whirr.cluster-name=mycassandracluster +whirr.instance-templates=3 cassandra +whirr.provider=ec2 +whirr.identity= +whirr.credential= +whirr.private-key-file=${sys:user.home}/.ssh/id_rsa +{code} + +h2. How do I automatically tear down a cluster after a fixed time? + +It's often convenient to terminate a cluster a fixed time after launch. This is +the case for test clusters, for example. You can achieve this by scheduling the +destroy command using the {{at}} command from your local machine. + +*WARNING: The machine from which you issued the {{at}} command must be running (and able +to contact the cloud provider) at the time it runs.* + +{code} +% echo 'bin/whirr destroy-cluster --config hadoop.properties' \ + | at 'now + 50 min' +{code} + +Note that issuing a {{shutdown}} command on an instance may simply stop the +instance, which is not sufficient to fully terminate the instance, in which +case you would continue to be charged for it. This is the +case for EBS boot instances, for example. + +You can read more about this technique on +[Eric Hammond's blog|http://alestic.com/2010/09/ec2-instance-termination]. + +Also, Mac OS X users might find +[this thread|http://superuser.com/questions/43678/mac-os-x-at-command-not-working] +a useful reference for the {{at}} command. Modified: incubator/whirr/trunk/src/site/confluence/index.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/index.confluence?rev=1005565&r1=1005564&r2=1005565&view=diff ============================================================================== --- incubator/whirr/trunk/src/site/confluence/index.confluence (original) +++ incubator/whirr/trunk/src/site/confluence/index.confluence Thu Oct 7 18:30:54 2010 @@ -26,6 +26,25 @@ h2. Getting Started You can use Whirr's CLI or APIs to [get started with Whirr|quick-start-guide]. +There is also an [FAQ|faq] which covers how to achieve common +tasks with Whirr, and a [configuration guide|configuration-guide] for reference. + h2. Getting Involved Have you got a suggestion for improving Whirr? It's easy to [get involved|https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute]. + +h2. History + +The code that would become Whirr started out in 2007 as some +[bash scripts in Apache Hadoop|https://issues.apache.org/jira/browse/HADOOP-884] +for running Hadoop clusters on EC2. Later the scripts were +[ported to Python|https://issues.apache.org/jira/browse/WHIRR-3] +for extra features (such as EBS support) and a wider range of cloud providers. +These Python scripts are available today in Whirr under _contrib/python_. + +In May 2010 the [Apache Whirr Incubator|http://incubator.apache.org/whirr] +project was started to give a home to +the existing work that had been done, but also to create a Java version +using [jclouds|http://code.google.com/p/jclouds/] as the cloud provisioning +library. jclouds supports many providers and has a very rich API for running +code on instances, so it provides a very solid foundation for building Whirr on. \ No newline at end of file Modified: incubator/whirr/trunk/src/site/confluence/quick-start-guide.confluence URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/confluence/quick-start-guide.confluence?rev=1005565&r1=1005564&r2=1005565&view=diff ============================================================================== --- incubator/whirr/trunk/src/site/confluence/quick-start-guide.confluence (original) +++ incubator/whirr/trunk/src/site/confluence/quick-start-guide.confluence Thu Oct 7 18:30:54 2010 @@ -1,37 +1,37 @@ h1. Getting Started with Whirr -h2. Whirr CLI +The Whirr CLI provides the most convenient way to launch clusters. h3. Pre-requisites -You need to install Java 6 on your machine. Also, you need to have an account with a cloud provider, -such as Amazon EC2. +You need to install Java 6 on your machine. Also, you need to have an account with a cloud provider, such as Amazon EC2. h3. Install Whirr [Download|http://www.apache.org/dyn/closer.cgi/incubator/whirr/] or -[build|https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute] Whirr. Call the directory -which contains the Whirr JAR files {{WHIRR\_HOME}} (you might like to define this as an environment variable). +[build|https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute] Whirr. -You can test that Whirr is working by running -(this is for version {{0.1.0-incubating}}): +You can test that Whirr is working by running: {code} -% java -jar $WHIRR_HOME/whirr-cli-0.1.0-incubating.jar +% bin/whirr version {code} -It is handy to create an alias for whirr: +Which will display the version of Whirr that is installed. + +To get usage instructions and a list of available services type: {code} -% alias whirr='java -jar $WHIRR_HOME/whirr-cli-0.1.0-incubating.jar' +% bin/whirr {code} -h3. Launch a cluster +h3. Launch a Hadoop cluster First, create a properties file to define the cluster. The name doesn't matter, but here we will assume it is called _hadoop.properties_ and located in your home directory. This file defines a cluster with a single machine for the namenode and jobtracker, and -a further machine for a datanode and tasktracker. +a further machine for a datanode and tasktracker. You can see how to launch +other services in the [FAQ|faq#other-services]. {code} whirr.service-name=hadoop @@ -43,32 +43,114 @@ whirr.credential=_. Run it as a follows (in a new terminal window): + +{code} +% . ~/.whirr/myhadoopcluster/hadoop-proxy.sh +{code} + +To stop the proxy, just kill the process with Ctrl-C. -When you've finished using a cluster you can terminate the instances and clean up resources with +Web browsers need to be configured to use this proxy too, so you can view pages +served by worker nodes in the cluster. The most convenient way to do this is to +use a +[proxy auto-config (PAC) file|http://en.wikipedia.org/wiki/Proxy\_auto-config] +file, such as [this one|http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac] for +Hadoop EC2 clusters. +If you are using Firefox, then you may find +[FoxyProxy|http://foxyproxy.mozdev.org/] useful for managing PAC files. + +h3. Run a MapReduce job + +After you launch a cluster, a _hadoop-site.xml_ file is created in the directory +_~/.whirr/_. You can use this to connect to the cluster by setting +the {{HADOOP\_CONF\_DIR}} environment variable. +(It is also possible to set the configuration file to use by passing it as a +{{-conf}} option to Hadoop Tools): + +{code} +% export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster {code} -% whirr destroy-cluster --config hadoop.properties + +You should now be able to browse HDFS: + +{code} +% hadoop fs -ls / {code} -h2. Whirr API +Note that the version of Hadoop installed locally should match the version +installed on the cluster. You should also make sure that the {{HADOOP\_HOME}} +environment variable is set. -Whirr provides a Java API for stopping and starting clusters. Please see the unit test source code for -how to achieve this. +Here's how you can run a MapReduce job: -There's also some example code at [http://github.com/hammer/whirr-demo]. +{code} +hadoop fs -mkdir input +hadoop fs -put $HADOOP_HOME/LICENSE.txt input +hadoop jar $HADOOP_HOME/hadoop-*examples.jar wordcount input output +hadoop fs -cat output/part-* | head +{code} + +h3. Configuration + +Whirr is configured using a properties file, and optionally using command line arguments when using the CLI. Command line arguments take precedence over properties specified in a properties file. + +For example, instead of using the properties file above, you could launch a +Hadoop cluster with the following command line (note that the {{whirr.}} prefix +for properties is not reflected in the command line argument): + +{code} +% bin/whirr launch-cluster \ + --service-name=hadoop \ + --cluster-name=myhadoopcluster \ + --instance-templates='1 jt+nn,1 dn+tt' \ + --provider=ec2 \ + --identity=$AWS_ACCESS_KEY_ID \ + --credential=$AWS_SECRET_ACCESS_KEY \ + --private-key-file=~/.ssh/id_rsa +{code} + +Notice that here we took advantage of the fact that the AWS credentials have +been defined in environment variables. + +See the [configuration guide|configuration-guide] for a list of all the configuration +properties you can set. + +h3. Destroy a cluster + +When you've finished using a cluster you can terminate the instances and clean up resources with the following. + +*WARNING: All data will be deleted when you destroy the cluster.* + +{code} +% bin/whirr destroy-cluster --config hadoop.properties +{code} +At this point you shut down the SSH proxy to the cluster if you started one +earlier. Modified: incubator/whirr/trunk/src/site/site.xml URL: http://svn.apache.org/viewvc/incubator/whirr/trunk/src/site/site.xml?rev=1005565&r1=1005564&r2=1005565&view=diff ============================================================================== --- incubator/whirr/trunk/src/site/site.xml (original) +++ incubator/whirr/trunk/src/site/site.xml Thu Oct 7 18:30:54 2010 @@ -66,11 +66,25 @@ - + + + + + + + + + + + + + + +