mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r887515 - in /websites/staging/mahout/trunk/content: ./ users/emr/use-an-existing-hadoop-ami.html
Date Thu, 21 Nov 2013 11:43:06 GMT
Author: buildbot
Date: Thu Nov 21 11:43:06 2013
New Revision: 887515

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/emr/use-an-existing-hadoop-ami.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 21 11:43:06 2013
@@ -1 +1 @@
-1544135
+1544136

Modified: websites/staging/mahout/trunk/content/users/emr/use-an-existing-hadoop-ami.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/emr/use-an-existing-hadoop-ami.html (original)
+++ websites/staging/mahout/trunk/content/users/emr/use-an-existing-hadoop-ami.html Thu Nov
21 11:43:06 2013
@@ -381,7 +381,8 @@
 
   <div id="content-wrap" class="clearfix">
    <div id="main">
-    <p>The following process was developed for launching Hadoop clusters in EC2 in
+    <h1 id="use-an-existing-hadoop-ami-with-mahout">Use an existing Hadoop AMI with
Mahout</h1>
+<p>The following process was developed for launching Hadoop clusters in EC2 in
 order to benchmark Mahout's clustering algorithms using a large document
 set (see Mahout-588). Specifically, we used the ASF mail archives that have
 been parsed and converted to the Hadoop SequenceFile format
@@ -401,12 +402,14 @@ Projects Testing Program.</p>
 <h2 id="launch-hadoop-cluster">Launch Hadoop Cluster</h2>
 <p><a name="UseanExistingHadoopAMI-GatherAmazonEC2keys/securitycredentials"></a></p>
 <h4 id="gather-amazon-ec2-keys-security-credentials">Gather Amazon EC2 keys / security
credentials</h4>
-<p>You will need the following:
-AWS Account ID
-Access Key ID
-Secret Access Key
-X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)
-EC2 Key-Pair (ssh public and private keys) for the US-EAST region.</p>
+<p>You will need the following:</p>
+<ul>
+<li>AWS Account ID</li>
+<li>Access Key ID</li>
+<li>Secret Access Key</li>
+<li>X.509 certificate and private key (e.g. cert-aws.pem and pk-aws.pem)</li>
+<li>EC2 Key-Pair (ssh public and private keys) for the US-EAST region.</li>
+</ul>
 <p>Please make sure the file permissions are "-rw-------" (e.g. chmod 600
 gsg-keypair.pem). You can create a key-pair for the US-East region using
 the Amazon console. If you are confused about any of these terms, please
@@ -446,13 +449,11 @@ you work through these steps.</p>
 <div class="codehilite"><pre><span class="n">sudo</span> <span
class="n">mkdir</span> <span class="o">-</span><span class="n">p</span>
<span class="o">/</span><span class="n">mnt</span><span class="o">/</span><span
class="n">dev</span><span class="o">/</span><span class="n">downloads</span>
 <span class="n">sudo</span> <span class="n">chown</span> <span
class="o">-</span><span class="n">R</span> <span class="n">ubuntu</span><span
class="p">:</span><span class="n">ubuntu</span> <span class="o">/</span><span
class="n">mnt</span><span class="o">/</span><span class="n">dev</span>
 <span class="n">cd</span> <span class="o">/</span><span class="n">mnt</span><span
class="o">/</span><span class="n">dev</span><span class="o">/</span><span
class="n">downloads</span>
-<span class="n">wget</span>
+<span class="n">wget</span> <span class="n">http</span><span class="p">:</span><span
class="o">//</span><span class="n">apache</span><span class="p">.</span><span
class="n">mirrors</span><span class="p">.</span><span class="n">hoobly</span><span
class="p">.</span><span class="n">com</span><span class="o">//</span><span
class="n">hadoop</span><span class="o">/</span><span class="n">core</span><span
class="o">/</span><span class="n">hadoop</span><span class="o">-</span>0<span
class="p">.</span>20<span class="p">.</span>2<span class="o">/</span><span
class="n">hadoop</span><span class="o">-</span>0<span class="p">.</span>20<span
class="p">.</span>2<span class="p">.</span><span class="n">tar</span><span
class="p">.</span><span class="n">gz</span> <span class="o">&amp;&amp;</span>
<span class="n">cd</span> <span class="o">/</span><span class="n">mnt</span><span
class="o">/</span><span class="n">dev</span> <span class="o">&amp;&amp;</span>
<span class="n">tar</span> <span class="n">zxvf</
 span> <span class="n">downloads</span><span class="o">/</span><span
class="n">hadoop</span><span class="o">-</span>0<span class="p">.</span>20<span
class="p">.</span>2<span class="p">.</span><span class="n">tar</span><span
class="p">.</span><span class="n">gz</span>
+<span class="n">ln</span> <span class="o">-</span><span class="n">s</span>
<span class="n">hadoop</span><span class="o">-</span>0<span class="p">.</span>20<span
class="p">.</span>2 <span class="n">hadoop</span>
 </pre></div>
 
 
-<p>http://apache.mirrors.hoobly.com//hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
-&amp;&amp; cd /mnt/dev &amp;&amp; tar zxvf downloads/hadoop-0.20.2.tar.gz
-    ln -s hadoop-0.20.2 hadoop </p>
 <p>The scripts we need are in $HADOOP_HOME/scr/contrib/ec2. There are other
 approaches to deploying a Hadoop cluster on EC2, such as Cloudera's <a href="https://docs.cloudera.com/display/DOC/Cloudera+Documentation+Home+Page">CDH3</a>
 . We chose to use the contrib/ec2 scripts because they are very easy to use
@@ -540,23 +541,22 @@ processes on the master server using: ps
 is not the latest version of Mahout.</p>
 <div class="codehilite"><pre><span class="n">mkdir</span> <span
class="o">-</span><span class="n">p</span> <span class="o">/</span><span
class="n">mnt</span><span class="o">/</span><span class="n">dev</span><span
class="o">/</span><span class="n">downloads</span>
 <span class="n">cd</span> <span class="o">/</span><span class="n">mnt</span><span
class="o">/</span><span class="n">dev</span><span class="o">/</span><span
class="n">downloads</span>
-<span class="n">wget</span> <span class="n">http</span><span class="p">:</span><span
class="o">//</span><span class="n">apache</span><span class="p">.</span><span
class="n">mesi</span><span class="p">.</span><span class="n">com</span><span
class="p">.</span><span class="n">ar</span><span class="o">//</span><span
class="n">mahout</span><span class="o">/</span>0<span class="p">.</span>4<span
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span
class="n">distribution</span><span class="o">-</span>0<span class="p">.</span>4<span
class="p">.</span><span class="n">tar</span><span class="p">.</span><span
class="n">gz</span>
+<span class="n">wget</span> <span class="n">http</span><span class="p">:</span><span
class="o">//</span><span class="n">apache</span><span class="p">.</span><span
class="n">mesi</span><span class="p">.</span><span class="n">com</span><span
class="p">.</span><span class="n">ar</span><span class="o">//</span><span
class="n">mahout</span><span class="o">/</span>0<span class="p">.</span>4<span
class="o">/</span><span class="n">mahout</span><span class="o">-</span><span
class="n">distribution</span><span class="o">-</span>0<span class="p">.</span>4<span
class="p">.</span><span class="n">tar</span><span class="p">.</span><span
class="n">gz</span> <span class="o">&amp;&amp;</span> <span
class="n">cd</span> <span class="o">/</span><span class="n">mnt</span><span
class="o">/</span><span class="n">dev</span> <span class="o">&amp;&amp;</span>
<span class="n">tar</span> <span class="n">zxvf</span> <span class="n">downloads</span><span
class="o">/</span><span class="n">mahout</span><span class="
 o">-</span><span class="n">distribution</span><span class="o">-</span>0<span
class="p">.</span>4<span class="p">.</span><span class="n">tar</span><span
class="p">.</span><span class="n">gz</span>
+<span class="n">ln</span> <span class="o">-</span><span class="n">s</span>
<span class="n">mahout</span><span class="o">-</span><span class="n">distribution</span><span
class="o">-</span>0<span class="p">.</span>4 <span class="n">mahout</span>
 </pre></div>
 
 
-<p>&amp;&amp; cd /mnt/dev &amp;&amp; tar zxvf downloads/mahout-distribution-0.4.tar.gz
-    ln -s mahout-distribution-0.4 mahout</p>
 <p><a name="UseanExistingHadoopAMI-FromSource"></a></p>
 <h5 id="from-source">From Source</h5>
-<div class="codehilite"><pre><span class="n">Install</span> <span
class="n">Subversion</span><span class="p">:</span> <span class="o">&gt;</span><span
class="n">yum</span> <span class="n">install</span> <span class="n">subversion</span>
<span class="o">//</span><span class="n">Note</span><span class="p">,</span>
<span class="n">you</span> <span class="n">can</span> <span class="n">also</span>
<span class="n">use</span> <span class="n">Git</span><span class="p">,</span>
+<div class="codehilite"><pre><span class="n">Install</span> <span
class="n">Subversion</span><span class="p">:</span> <span class="o">&gt;</span><span
class="n">yum</span> <span class="n">install</span> <span class="n">subversion</span>
<span class="o">//</span><span class="n">Note</span><span class="p">,</span>
<span class="n">you</span> <span class="n">can</span> <span class="n">also</span>
<span class="n">use</span> <span class="n">Git</span><span class="p">,</span>
<span class="n">so</span> <span class="n">substitute</span> <span
class="n">in</span> <span class="n">the</span> <span class="n">appropriate</span>
<span class="n">URL</span>
+<span class="n">svn</span> <span class="n">co</span> <span class="n">http</span><span
class="p">:</span><span class="o">//</span><span class="n">svn</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">org</span><span class="o">/</span><span class="n">repos</span><span
class="o">/</span><span class="n">asf</span><span class="o">/</span><span
class="n">mahout</span><span class="o">/</span><span class="n">trunk</span>
<span class="n">mahout</span><span class="o">/</span><span class="n">trunk</span>
+
+<span class="n">Install</span> <span class="n">Maven</span> 3<span
class="p">.</span><span class="n">x</span> <span class="n">and</span>
<span class="n">put</span> <span class="n">it</span> <span class="n">in</span>
<span class="n">the</span> <span class="n">path</span>
+<span class="n">cd</span> <span class="n">mahout</span><span class="o">/</span><span
class="n">trunk</span>
+<span class="n">mvn</span> <span class="n">install</span> <span
class="o">//</span><span class="n">Optionally</span> <span class="n">add</span>
<span class="o">-</span><span class="n">DskipTests</span>
 </pre></div>
 
 
-<p>so substitute in the appropriate URL
-    &gt; svn co http://svn.apache.org/repos/asf/mahout/trunk mahout/trunk
-    Install Maven 3.x and put it in the path
-    &gt; cd mahout/trunk
-    &gt; mvn install //Optionally add -DskipTests</p>
 <p><a name="UseanExistingHadoopAMI-ConfigureHadoop"></a></p>
 <h4 id="configure-hadoop">Configure Hadoop</h4>
 <p>You'll want to increase the Max Heap Size for the data nodes
@@ -589,12 +589,11 @@ have 2 per node or only 1 if your jobs a
 <h4 id="copy-the-vectors-from-s3-to-hdfs">Copy the vectors from S3 to HDFS</h4>
 <p>Use Hadoop's distcp command to copy the vectors from S3 to HDFS.</p>
 <div class="codehilite"><pre><span class="n">hadoop</span> <span
class="n">distcp</span> <span class="o">-</span><span class="n">Dmapred</span><span
class="p">.</span><span class="n">task</span><span class="p">.</span><span
class="n">timeout</span><span class="p">=</span>1800000 <span class="o">\</span>
-<span class="n">s3n</span><span class="p">:</span><span class="o">//</span><span
class="n">ACCESS_KEY</span><span class="p">:</span><span class="n">SECRET_KEY</span><span
class="p">@</span><span class="n">asf</span><span class="o">-</span><span
class="n">mail</span><span class="o">-</span><span class="n">archives</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span>0<span
class="p">.</span>4<span class="o">/</span><span class="n">sparse</span><span
class="o">-</span>1<span class="o">-</span><span class="n">gram</span><span
class="o">-</span><span class="n">stem</span><span class="o">/</span><span
class="n">tfidf</span><span class="o">-</span><span class="n">vectors</span>
+<span class="n">s3n</span><span class="p">:</span><span class="o">//</span><span
class="n">ACCESS_KEY</span><span class="p">:</span><span class="n">SECRET_KEY</span><span
class="p">@</span><span class="n">asf</span><span class="o">-</span><span
class="n">mail</span><span class="o">-</span><span class="n">archives</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span>0<span
class="p">.</span>4<span class="o">/</span><span class="n">sparse</span><span
class="o">-</span>1<span class="o">-</span><span class="n">gram</span><span
class="o">-</span><span class="n">stem</span><span class="o">/</span><span
class="n">tfidf</span><span class="o">-</span><span class="n">vectors</span><span
class="o">\</span>
+<span class="o">/</span><span class="n">asf</span><span class="o">-</span><span
class="n">mail</span><span class="o">-</span><span class="n">archives</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span>0<span
class="p">.</span>4<span class="o">/</span><span class="n">tfidf</span><span
class="o">-</span><span class="n">vectors</span>
 </pre></div>
 
 
-<p>\
-    /asf-mail-archives/mahout-0.4/tfidf-vectors</p>
 <p>The files are stored in the US-Standard S3 bucket so there is no charge for
 data transfer to your EC2 cluster, as it is running in the US-EAST region.</p>
 <p><a name="UseanExistingHadoopAMI-Launchtheclusteringjob(fromthemasterserver)"></a></p>
@@ -605,26 +604,20 @@ data transfer to your EC2 cluster, as it
   <span class="o">-</span><span class="n">o</span> <span class="o">/</span><span
class="n">asf</span><span class="o">-</span><span class="n">mail</span><span
class="o">-</span><span class="n">archives</span><span class="o">/</span><span
class="n">mahout</span><span class="o">-</span>0<span class="p">.</span>4<span
class="o">/</span><span class="n">kmeans</span><span class="o">-</span><span
class="n">clusters</span><span class="o">/</span> <span class="o">\</span>
   <span class="o">--</span><span class="n">numClusters</span> 100
<span class="o">\</span>
   <span class="o">--</span><span class="n">maxIter</span> 10 <span
class="o">\</span>
-  <span class="o">--</span><span class="n">distanceMeasure</span>
<span class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">common</span><span class="p">.</span><span class="n">distance</span><span
class="p">.</span><span class="n">CosineDistanceMeasure</span>
+  <span class="o">--</span><span class="n">distanceMeasure</span>
<span class="n">org</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">mahout</span><span class="p">.</span><span
class="n">common</span><span class="p">.</span><span class="n">distance</span><span
class="p">.</span><span class="n">CosineDistanceMeasure</span><span
class="o">\</span>
+  <span class="o">--</span><span class="n">convergenceDelta</span>
0<span class="p">.</span>01 <span class="o">&amp;</span>
 </pre></div>
 
 
-<p>\
-      --convergenceDelta 0.01 &amp;</p>
 <p>You can monitor the job using the JobTracker Web UI through FoxyProxy.</p>
 <p><a name="UseanExistingHadoopAMI-DumpClusters"></a></p>
 <h4 id="dump-clusters">Dump Clusters</h4>
 <p>Once completed, you can view the results using Mahout's cluster dumper</p>
-<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">clusterdump</span> <span class="o">--</span><span
class="n">seqFileDir</span>
+<div class="codehilite"><pre><span class="n">bin</span><span class="o">/</span><span
class="n">mahout</span> <span class="n">clusterdump</span> <span class="o">--</span><span
class="n">seqFileDir</span> <span class="o">/</span><span class="n">asf</span><span
class="o">-</span><span class="n">mail</span><span class="o">-</span><span
class="n">archives</span><span class="o">/</span><span class="n">mahout</span><span
class="o">-</span>0<span class="p">.</span>4<span class="o">/</span><span
class="n">kmeans</span><span class="o">-</span><span class="n">clusters</span><span
class="o">/</span><span class="n">clusters</span><span class="o">-</span>1<span
class="o">/</span> <span class="o">\</span>
+  <span class="o">--</span><span class="n">numWords</span> 20 <span
class="o">\</span>
+  <span class="o">--</span><span class="n">dictionary</span> <span
class="n">s3n</span><span class="p">:</span><span class="o">//</span><span
class="n">ACCESS_KEY</span><span class="p">:</span><span class="n">SECRET_KEY</span><span
class="p">@</span><span class="n">asf</span><span class="o">-</span><span
class="n">mail</span><span class="o">-</span><span class="n">archives</span><span
class="o">/</span><span class="n">mahout</span><span class="o">-</span>0<span
class="p">.</span>4<span class="o">/</span><span class="n">sparse</span><span
class="o">-</span>1<span class="o">-</span><span class="n">gram</span><span
class="o">-</span><span class="n">stem</span><span class="o">/</span><span
class="n">dictionary</span><span class="p">.</span><span class="n">file</span><span
class="o">-</span>0 <span class="o">\</span>
+  <span class="o">--</span><span class="n">dictionaryType</span>
<span class="n">sequencefile</span> <span class="o">--</span><span
class="n">output</span> <span class="n">clusters</span><span class="p">.</span><span
class="n">txt</span> <span class="o">--</span><span class="n">substring</span>
100
 </pre></div>
-
-
-<p>/asf-mail-archives/mahout-0.4/kmeans-clusters/clusters-1/ \
-      --numWords 20 \
-      --dictionary
-s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/dictionary.file-0
-\
-      --dictionaryType sequencefile --output clusters.txt --substring 100</p>
    </div>
   </div>     
 </div> 



Mime
View raw message