hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "AmazonEC2" by SteveLoughran
Date Fri, 29 Jul 2011 11:44:25 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "AmazonEC2" page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/AmazonEC2?action=diff&rev1=93&rev2=94

Comment:
rm linkspam, update URL to JDK, rm all refs to 0.17 or earlier.

  
  '''Version 0.17''' of Hadoop includes a few changes that provide support for multiple simultaneous
clusters, quicker startup times for large clusters, and includes a pre-configured Ganglia
installation. These differences are noted below.
  
- '''Note:''' Cloudera also provides their [[http://www.cloudera.com/hadoop-ec2|distribution
for Hadoop]] as an EC2 AMI with single command deployment and support for Hive/Pig out [[http://www.personal-injury-solicitors.com|Personal
Injury Solicitor]] of the box.
+ '''Note:''' Cloudera also provides their [[http://www.cloudera.com/hadoop-ec2|distribution
for Hadoop]] as an EC2 AMI with single command deployment and support for Hive/Pig out of
the box.
  
  == Preliminaries ==
  
@@ -31, +31 @@

  
  Clusters of Hadoop instances are created in a security group. Instances within the group
have unfettered access to one another. Machines outside the group (such as your workstation),
can only access instance on port 22 (for SSH), port 50030 (for the JobTracker's web interface,
permitting one to view job status), and port 50060 (for the TaskTracker's web interface, for
more detailed debugging).
  
- ('''Pre Hadoop 0.17''') These EC2 scripts require slave nodes to be able to establish SSH
connections to the master node (and vice versa). This is achieved after the cluster has launched
by copying the EC2 private key to all machines in the cluster.
- 
  == Setting up ==
-  * Unpack [[http://www.apache.org/dyn/closer.cgi/hadoop/core/|the latest Hadoop distribution]]
on your system (version 0.12.0 or later).
+  * Unpack [[http://www.apache.org/dyn/closer.cgi/hadoop/core/|the latest Hadoop distribution]]
on your system 
   * Edit all relevant variables in ''src/contrib/ec2/bin/hadoop-ec2-env.sh''.
     * Amazon Web Services variables (`AWS_ACCOUNT_ID`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
       * All need filling in - they can be found by logging in to http://aws.amazon.com/.
@@ -47, +45 @@

  % ec2-describe-images -x all | grep hadoop
  }}}
       * The default value for `S3_BUCKET` (`hadoop-ec2-images`) is for public images. Images
for Hadoop version 0.17.1 and later are in the `hadoop-images` bucket, so you should change
this variable if you want to use one of these images. You also need to change this if you
want to use a private image you have built yourself.      
-    * ('''Pre 0.17''') Hadoop cluster variables (`GROUP`, `MASTER_HOST`, `NO_INSTANCES`)
-      * `GROUP` specifies the private group to run the cluster in. Typically the default
value is fine.
-      * `MASTER_HOST` is the hostname of the master node in the cluster. You need to set
this to be a hostname that you have DNS control over - it needs resetting every time a cluster
is launched. Services such as [[http://www.dyndns.com/services/dns/dyndns/|DynDNS]] and [[http://developer.amazonwebservices.com/connect/thread.jspa?messageID=61609#61609|the
like]] make this fairly easy.
-      * `NO_INSTANCES` sets the number of instances in your cluster. You need to set this.
Currently Amazon limits the number of concurrent instances to 20.
  
- == Running a job on a cluster (Pre 0.17) ==
-  * Open a command prompt in ''src/contrib/ec2''.
-  * Launch a EC2 cluster and start Hadoop with the following command. During execution of
this script you will be prompted to set up DNS. {{{
- % bin/hadoop-ec2 run
- }}}
-  * You will then be logged into the master node where you can start your job.
-    * For example, to test your cluster, try {{{
- # cd /usr/local/hadoop-*
- # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
- }}}
-  * You can check progress of your job at `http://<MASTER_HOST>:50030/`.
-  * You can login to the master node from your workstation by typing: {{{
- % bin/hadoop-ec2 login
- }}}
-  * When you have finished, shutdown the cluster with the following:
-    * For Hadoop 0.14.0 and newer:{{{
- % bin/hadoop-ec2 terminate-cluster
- }}}
-    * For Hadoop 0.13.1 and older: /!\ '''NB: this command will terminate ''all'' your EC2
instances. See [[https://issues.apache.org/jira/browse/HADOOP-1504|HADOOP-1504]].'''{{{
- % bin/hadoop-ec2 terminate
- }}}
  
- == Running a job on a cluster (0.17+) ==
+ == Running a job on a cluster  ==
   * Open a command prompt in ''src/contrib/ec2''.
   * Launch a EC2 cluster and start Hadoop with the following command. You must supply a cluster
name (test-cluster) and the number of slaves (2). After the cluster boots, the public DNS
name will be printed to the console. {{{
  % bin/hadoop-ec2 launch-cluster test-cluster 2
@@ -94, +67 @@

   * Keep in mind that the master node is started first and configured, then all slaves nodes
are booted simultaneously with boot parameters pointing to the master node. Even though the
`lauch-cluster` command has returned, the whole cluster may not have yet 'booted'. You should
monitor the cluster via port 50030 to make sure all nodes are up. 
  
  <<Anchor(FromRemoteMachine)>>
- === Running a job on a cluster from a remote machine (0.17+) ===
+ === Running a job on a cluster from a remote machine ===
  In some cases it's desirable to be able to submit a job to a hadoop cluster running in EC2
from a machine that's outside EC2 (for example a personal workstation). Similarly - it's convenient
to be able to browse/cat files in HDFS from a remote machine. One of the advantages of this
setup is that it obviates the need to create custom AMIs that bundle stock Hadoop AMIs and
user libraries/code. All the non-Hadoop code can be kept on the remote machine and can be
made available to Hadoop during job submission time (in the form of jar files and other files
that are copied into Hadoop's distributed cache). The only downside being the [[http://aws.amazon.com/ec2/#pricing|cost
of copying these data sets]] into EC2 and the latency involved in doing so.
  
  The recipe for doing this is well documented in [[http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/|this
Cloudera blog post]] and involves configuring hadoop to use a ssh tunnel through the master
hadoop node. In addition - this recipe only works when using EC2 scripts from versions of
Hadoop that have the fix for [[https://issues.apache.org/jira/browse/HADOOP-5839|HADOOP-5839]]
incorporated. (Alternatively, users can apply patches from this JIRA to older versions of
Hadoop that do not have this fix).
  
- == Troubleshooting (Pre 0.17) ==
- Running Hadoop on EC2 involves a high level of configuration, so it can take a few goes
to get the system working for your particular set up.
  
- If you are having problems with the Hadoop EC2 `run` command then you can run the following
in turn, which have the same effect but may help you to see where the problem is occurring:
{{{
- % bin/hadoop-ec2 launch-cluster
- % bin/hadoop-ec2 start-hadoop
- }}}
- 
- Currently, the scripts don't have much in the way of error detection or handling. If a script
produces an error, then you may need to use the Amazon EC2 tools for interacting with instances
directly - for example, to shutdown an instance that is mis-configured.
- 
- Another technique for debugging is to manually run the scripts line-by-line until the error
occurs. If you have feedback or suggestions, or need help then please use the Hadoop mailing
lists.
- 
- == Troubleshooting (0.17) ==
+ == Troubleshooting ==
  Running Hadoop on EC2 involves a high level of configuration, so it can take a few goes
to get the system working for your particular set up.
  
  If you are having problems with the Hadoop EC2 `launch-cluster` command then you can run
the following in turn, which have the same effect but may help you to see where the problem
is occurring: {{{
@@ -157, +119 @@

     * ('''0.17''') AMI size selection (`INSTANCE_TYPE`)
       * When creating an AMI, `INSTANCE_TYPE` denotes the instance size the image will be
run on (small, large, or xlarge). Ultimately this decides if the image is `i386` or `x86_64`,
so this value is also used on cluster startup.
     * Java variables
-      * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the [[http://java.sun.com/javase/downloads/index.jsp|Sun
Java downloads page]], select a recent stable JDK, and get the URL for the JDK (not JRE) labelled
"Linux self-extracting file".
+      * `JAVA_BINARY_URL` is the download URL for a Sun JDK. Visit the [[http://www.oracle.com/technetwork/java/javase/downloads/index.html|Oracle
Java downloads page]], select a [[HadoopJavaVersions|stable JDK]], and get the URL for the
JDK (not JRE) labelled "Linux self-extracting file".
       * `JAVA_VERSION` is the version number of the JDK to be installed.
     * All other variables should be set as above.
   * Type {{{
  % bin/hadoop-ec2 create-image
  }}}
   * Accept the Java license terms.
-  * The script will create a new image, then bundle, upload and register it. This may[[http://www.palaceloan.com|payday
loans online]] take some time (20 minutes or more). Be patient - don't assume it's crashed!
+  * The script will create a new image, then bundle, upload and register it. This may take
some time (20 minutes or more). Be patient - don't assume it's crashed!
   * Terminate your instance using the command given by the script.
  
  If you need to repeat this procedure to re-create an AMI then you will need to run `ec2-deregister`
to de-register the existing AMI. You might also want to use `ec2-delete-bundle` command to
remove the AMI from S3 if you no longer need it.

Mime
View raw message