mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r948531 - in /websites/staging/mahout/trunk/content: ./ users/environment/how-to-build-an-app.html
Date Tue, 21 Apr 2015 00:22:47 GMT
Author: buildbot
Date: Tue Apr 21 00:22:47 2015
New Revision: 948531

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Apr 21 00:22:47 2015
@@ -1 +1 @@
-1675008
+1675009

Modified: websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html
==============================================================================
--- websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html (original)
+++ websites/staging/mahout/trunk/content/users/environment/how-to-build-an-app.html Tue Apr
21 00:22:47 2015
@@ -273,106 +273,132 @@
 </ul>
 <h2 id="application">Application</h2>
 <p>Using Mahout as a library in an application will require a little Scala code. We
have an App trait in Scala so we'll create an object, which inherits from <code>App</code></p>
-<p><code>object CooccurrenceDriver extends App {
-}</code>
-This will look a little different than Java since <code>App</code> does delayed
initialization, which causes the main body to be executed when the App is launched, just as
in Java you would create a CooccurrenceDriver.main.</p>
+<div class="codehilite"><pre><span class="n">object</span> <span
class="n">CooccurrenceDriver</span> <span class="n">extends</span> <span
class="n">App</span> <span class="p">{</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>This will look a little different than Java since <code>App</code> does
delayed initialization, which causes the main body to be executed when the App is launched,
just as in Java you would create a CooccurrenceDriver.main.</p>
 <p>Before we can execute something on Spark we'll need to create a context. We could
use raw Spark calls here but default values are setup for a Ma  // strip off names, which
only takes and array of IndexedDatasets
   val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a =&gt; a._2))
 hout context.</p>
-<p><code>implicit val mc = mahoutSparkContext(masterUrl = "local", appName =
"2-input-cooc")</code>
-We need to read in three files containing different interaction types. The files will each
be read into a Mahout IndexedDataset. This allows us to preserve application-specific user
and item IDs throughout the calculations.</p>
+<div class="codehilite"><pre><span class="n">implicit</span> <span
class="n">val</span> <span class="n">mc</span> <span class="p">=</span>
<span class="n">mahoutSparkContext</span><span class="p">(</span><span
class="n">masterUrl</span> <span class="p">=</span> &quot;<span
class="n">local</span>&quot;<span class="p">,</span> <span class="n">appName</span>
<span class="p">=</span> &quot;2<span class="o">-</span><span
class="n">input</span><span class="o">-</span><span class="n">cooc</span>&quot;<span
class="p">)</span>
+</pre></div>
+
+
+<p>We need to read in three files containing different interaction types. The files
will each be read into a Mahout IndexedDataset. This allows us to preserve application-specific
user and item IDs throughout the calculations.</p>
 <p>For example, here is data/purchase.csv:</p>
-<p>```
-u1,iphone
-u1,ipad
-u2,nexus
-u2,galaxy
-u3,surface
-u4,iphone
-u4,galaxy</p>
-<p>```
-Mahout has a helper function that reads the text delimited in SparkEngine.indexedDatasetDFSReadElements.
The function reads single elements in a distributed way to create the IndexedDataset. </p>
+<div class="codehilite"><pre><span class="n">u1</span><span class="p">,</span><span
class="n">iphone</span>
+<span class="n">u1</span><span class="p">,</span><span class="n">ipad</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">nexus</span>
+<span class="n">u2</span><span class="p">,</span><span class="n">galaxy</span>
+<span class="n">u3</span><span class="p">,</span><span class="n">surface</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">iphone</span>
+<span class="n">u4</span><span class="p">,</span><span class="n">galaxy</span>
+</pre></div>
+
+
+<p>Mahout has a helper function that reads the text delimited in SparkEngine.indexedDatasetDFSReadElements.
The function reads single elements in a distributed way to create the IndexedDataset. </p>
 <p>Notice we read in all datasets before we adjust the number of rows in them to match
the total number of users in the data. This is so the math works out even if some users took
one action but not another.</p>
-<p>```
-/*<em>
- * Read files of element tuples and create IndexedDatasets one per action. These share a
userID BiMap but have
- * their own itemID BiMaps
- </em>/
-def readActions(actionInput: Array[(String, String)]): Array[(String, IndexedDataset)] =
{
-  var actions = Array<a href="">(String, IndexedDataset)</a></p>
-<p>val userDictionary: BiMap[String, Int] = HashBiMap.create()</p>
-<p>// The first action named in the sequence is the "primary" action and 
-  // begins to fill up the user dictionary
-  for ( actionDescription &lt;- actionInput ) {// grab the path to actions
-    val action: IndexedDataset = SparkEngine.indexedDatasetDFSReadElements(
-      actionDescription._2,
-      schema = DefaultIndexedDatasetElementReadSchema,
-      existingRowIDs = userDictionary)
-    userDictionary.putAll(action.rowIDs)
-    // put the name in the tuple with the indexedDataset
-    actions = actions :+ (actionDescription._1, action) 
-  }</p>
-<p>// After all actions are read in the userDictonary will contain every user seen,

-  // even if they may not have taken all actions . Now we adjust the row rank of 
-  // all IndxedDataset's to have this number of rows
-  // Note: this is very important or the cooccurrence calc may fail
-  val numUsers = userDictionary.size() // one more than the cardinality</p>
-<p>val resizedNameActionPairs = actions.map { a =&gt;
-    //resize the matrix by, in effect by adding empty rows
-    val resizedMatrix = a._2.create(a._2.matrix, userDictionary, a._2.columnIDs).newRowCardinality(numUsers)
-    (a._1, resizedMatrix) // return the Tuple of (name, IndexedDataset)
-  }
-  resizedNameActionPairs // return the array of Tuples
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Read files of element tuples and create IndexedDatasets
one per action. These share     a userID BiMap but have
+ <span class="o">*</span> their own itemID BiMaps
+ <span class="o">*/</span>
+def readActions<span class="p">(</span>actionInput: Array<span class="p">[(</span>String<span
class="p">,</span> String<span class="p">)])</span>: Array<span class="p">[(</span>String<span
class="p">,</span> IndexedDataset<span class="p">)]</span> <span class="o">=</span>
<span class="p">{</span>
+  var actions <span class="o">=</span> Array<span class="p">[(</span>String<span
class="p">,</span> IndexedDataset<span class="p">)]()</span>
+
+  val userDictionary: BiMap<span class="p">[</span>String<span class="p">,</span>
Int<span class="p">]</span> <span class="o">=</span> HashBiMap.create<span
class="p">()</span>
+
+  <span class="o">//</span> The first action named in the sequence is the <span
class="s">&quot;primary&quot;</span> action and 
+  <span class="o">//</span> begins to fill up the user dictionary
+  <span class="kr">for</span> <span class="p">(</span> actionDescription
<span class="o">&lt;-</span> actionInput <span class="p">)</span>
<span class="p">{</span><span class="o">//</span> grab the path to
actions
+    val action: IndexedDataset <span class="o">=</span> SparkEngine.indexedDatasetDFSReadElements<span
class="p">(</span>
+      actionDescription._2<span class="p">,</span>
+      schema <span class="o">=</span> DefaultIndexedDatasetElementReadSchema<span
class="p">,</span>
+      existingRowIDs <span class="o">=</span> userDictionary<span class="p">)</span>
+    userDictionary.putAll<span class="p">(</span>action.rowIDs<span class="p">)</span>
+    <span class="o">//</span> put the name in the tuple with the indexedDataset
+    actions <span class="o">=</span> actions :<span class="o">+</span>
<span class="p">(</span>actionDescription._1<span class="p">,</span>
action<span class="p">)</span> 
+  <span class="p">}</span>
+
+  <span class="o">//</span> After all actions are read in the userDictonary will
contain every user seen<span class="p">,</span> 
+  <span class="o">//</span> even if they may not have taken all actions <span
class="m">.</span> Now we adjust the row rank of 
+  <span class="o">//</span> all IndxedDataset<span class="s">&#39;</span><span
class="err">s to have this number of rows</span>
+  <span class="o">//</span> Note: this is very important or the cooccurrence
calc may fail
+  val numUsers <span class="o">=</span> userDictionary.size<span class="p">()</span>
<span class="o">//</span> one more than the cardinality
+
+  val resizedNameActionPairs <span class="o">=</span> actions.map <span class="p">{</span>
a <span class="o">=&gt;</span>
+    <span class="o">//</span>resize the matrix by<span class="p">,</span>
in effect by adding empty rows
+    val resizedMatrix <span class="o">=</span> a._2.create<span class="p">(</span>a._2.matrix<span
class="p">,</span> userDictionary<span class="p">,</span> a._2.columnIDs<span
class="p">)</span><span class="m">.</span>newRowCardinality<span class="p">(</span>numUsers<span
class="p">)</span>
+    <span class="p">(</span>a._1<span class="p">,</span> resizedMatrix<span
class="p">)</span> <span class="o">//</span> return the Tuple of <span
class="p">(</span>name<span class="p">,</span> IndexedDataset<span
class="p">)</span>
+  <span class="p">}</span>
+  resizedNameActionPairs <span class="o">//</span> return the array of Tuples
+<span class="p">}</span>
+</pre></div>
+
+
 <p>Now that we have the data read in we can perform the cooccurrence calculation.</p>
-<p>```
-// strip off names, which only takes and array of IndexedDatasets
-val indicatorMatrices = SimilarityAnalysis.cooccurrencesIDSs(actions.map(a =&gt; a._2))</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="c1">// strip off names, which
only takes and array of IndexedDatasets</span>
+<span class="n">val</span> <span class="n">indicatorMatrices</span>
<span class="o">=</span> <span class="n">SimilarityAnalysis</span><span
class="p">.</span><span class="n">cooccurrencesIDSs</span><span class="p">(</span><span
class="n">actions</span><span class="p">.</span><span class="n">map</span><span
class="p">(</span><span class="n">a</span> <span class="o">=&gt;</span>
<span class="n">a</span><span class="p">.</span><span class="n">_2</span><span
class="p">))</span>
+</pre></div>
+
+
 <p>All we need to do now is write the indicators.</p>
-<p><code>// zip a pair of arrays into an array of pairs, reattaching the action
names
-val indicatorDescriptions = actions.map(a =&gt; a._1).zip(indicatorMatrices)
-writeIndicators(indicatorDescriptions)</code></p>
+<div class="codehilite"><pre><span class="c1">// zip a pair of arrays into
an array of pairs, reattaching the action names</span>
+<span class="n">val</span> <span class="n">indicatorDescriptions</span>
<span class="o">=</span> <span class="n">actions</span><span class="p">.</span><span
class="n">map</span><span class="p">(</span><span class="n">a</span>
<span class="o">=&gt;</span> <span class="n">a</span><span
class="p">.</span><span class="n">_1</span><span class="p">).</span><span
class="n">zip</span><span class="p">(</span><span class="n">indicatorMatrices</span><span
class="p">)</span>
+</pre></div>
+
+
+<p>writeIndicators(indicatorDescriptions)</p>
 <p>The <code>writeIndicators</code> method uses the default write function
<code>dfsWrite</code>.</p>
-<p>```
-/*<em>
- * Write indicatorMatrices to the output dir in the default format
- </em>/
-def writeIndicators( indicators: Array[(String, IndexedDataset)]) = {
-  for (indicator &lt;- indicators ) {
-    val indicatorDir = OutputPath + indicator._1
-    indicator._2.dfsWrite(
-      indicatorDir, // do we have to remove the last $ char?
-      // omit LLR strengths and format for search engine indexing
-      IndexedDatasetWriteBooleanSchema) 
-  }
-}</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="o">/**</span>
+ <span class="o">*</span> Write indicatorMatrices to the output dir in the default
format
+ <span class="o">*/</span>
+def writeIndicators<span class="p">(</span> indicators: Array<span class="p">[(</span>String<span
class="p">,</span> IndexedDataset<span class="p">)])</span> <span
class="o">=</span> <span class="p">{</span>
+  <span class="kr">for</span> <span class="p">(</span>indicator <span
class="o">&lt;-</span> indicators <span class="p">)</span> <span
class="p">{</span>
+    val indicatorDir <span class="o">=</span> OutputPath <span class="o">+</span>
indicator._1
+    indicator._2.dfsWrite<span class="p">(</span>
+      indicatorDir<span class="p">,</span> <span class="o">//</span>
do we have to remove the last <span class="p">$</span> char?
+      <span class="o">//</span> omit LLR strengths and format for search engine
indexing
+      IndexedDatasetWriteBooleanSchema<span class="p">)</span> 
+  <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
 <p>See the Github project for the full source. Now we create a build.sbt to build the
example. </p>
-<p>```
-name := "cooccurrence-driver"</p>
-<p>organization := "com.finderbots"</p>
-<p>version := "0.1"</p>
-<p>scalaVersion := "2.10.4"</p>
-<p>val sparkVersion = "1.1.1"</p>
-<p>libraryDependencies ++= Seq(
-  "log4j" % "log4j" % "1.2.17",
-  // Mahout's Spark code
-  "commons-io" % "commons-io" % "2.4",
-  "org.apache.mahout" % "mahout-math-scala_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-spark_2.10" % "0.10.0",
-  "org.apache.mahout" % "mahout-math" % "0.10.0",
-  "org.apache.mahout" % "mahout-hdfs" % "0.10.0",
-  // Google collections, AKA Guava
-  "com.google.guava" % "guava" % "16.0")</p>
-<p>resolvers += "typesafe repo" at " http://repo.typesafe.com/typesafe/releases/"</p>
-<p>resolvers += Resolver.mavenLocal</p>
-<p>packSettings</p>
-<p>packMain := Map(
-  "cooc" -&gt; "CooccurrenceDriver"
-)</p>
-<p>```</p>
+<div class="codehilite"><pre><span class="n">name</span> <span
class="p">:=</span> &quot;<span class="n">cooccurrence</span><span
class="o">-</span><span class="n">driver</span>&quot;
+
+<span class="n">organization</span> <span class="p">:=</span> &quot;<span
class="n">com</span><span class="p">.</span><span class="n">finderbots</span>&quot;
+
+<span class="n">version</span> <span class="p">:=</span> &quot;0<span
class="p">.</span>1&quot;
+
+<span class="n">scalaVersion</span> <span class="p">:=</span> &quot;2<span
class="p">.</span>10<span class="p">.</span>4&quot;
+
+<span class="n">val</span> <span class="n">sparkVersion</span> <span
class="p">=</span> &quot;1<span class="p">.</span>1<span class="p">.</span>1&quot;
+
+<span class="n">libraryDependencies</span> <span class="o">++</span><span
class="p">=</span> <span class="n">Seq</span><span class="p">(</span>
+  &quot;<span class="n">log4j</span>&quot; <span class="c">% &quot;log4j&quot;
% &quot;1.2.17&quot;,</span>
+  <span class="o">//</span> <span class="n">Mahout</span><span
class="o">&#39;</span><span class="n">s</span> <span class="n">Spark</span>
<span class="n">code</span>
+  &quot;<span class="n">commons</span><span class="o">-</span><span
class="n">io</span>&quot; <span class="c">% &quot;commons-io&quot;
% &quot;2.4&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span>&quot;
<span class="c">% &quot;mahout-math-scala_2.10&quot; % &quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span>&quot;
<span class="c">% &quot;mahout-spark_2.10&quot; % &quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span>&quot;
<span class="c">% &quot;mahout-math&quot; % &quot;0.10.0&quot;,</span>
+  &quot;<span class="n">org</span><span class="p">.</span><span
class="n">apache</span><span class="p">.</span><span class="n">mahout</span>&quot;
<span class="c">% &quot;mahout-hdfs&quot; % &quot;0.10.0&quot;,</span>
+  <span class="o">//</span> <span class="n">Google</span> <span
class="n">collections</span><span class="p">,</span> <span class="n">AKA</span>
<span class="n">Guava</span>
+  &quot;<span class="n">com</span><span class="p">.</span><span
class="n">google</span><span class="p">.</span><span class="n">guava</span>&quot;
<span class="c">% &quot;guava&quot; % &quot;16.0&quot;)</span>
+
+<span class="n">resolvers</span> <span class="o">+</span><span
class="p">=</span> &quot;<span class="n">typesafe</span> <span
class="n">repo</span>&quot; <span class="n">at</span> &quot;
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span
class="n">repo</span><span class="p">.</span><span class="n">typesafe</span><span
class="p">.</span><span class="n">com</span><span class="o">/</span><span
class="n">typesafe</span><span class="o">/</span><span class="n">releases</span><span
class="o">/</span>&quot;
+
+<span class="n">resolvers</span> <span class="o">+</span><span
class="p">=</span> <span class="n">Resolver</span><span class="p">.</span><span
class="n">mavenLocal</span>
+
+<span class="n">packSettings</span>
+
+<span class="n">packMain</span> <span class="p">:=</span> <span
class="n">Map</span><span class="p">(</span>
+  &quot;<span class="n">cooc</span>&quot; <span class="o">-&gt;</span>
&quot;<span class="n">CooccurrenceDriver</span>&quot;<span class="p">)</span>
+</pre></div>
+
+
 <h2 id="build">Build</h2>
 <p>Building the examples from project's root folder:</p>
 <div class="codehilite"><pre>$ <span class="n">sbt</span> <span
class="n">pack</span>



Mime
View raw message