Return-Path: X-Original-To: apmail-mahout-commits-archive@www.apache.org Delivered-To: apmail-mahout-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 606A7116BE for ; Thu, 4 Sep 2014 14:57:07 +0000 (UTC) Received: (qmail 88953 invoked by uid 500); 4 Sep 2014 14:57:07 -0000 Delivered-To: apmail-mahout-commits-archive@mahout.apache.org Received: (qmail 88906 invoked by uid 500); 4 Sep 2014 14:57:07 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 88897 invoked by uid 99); 4 Sep 2014 14:57:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2014 14:57:07 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO eris.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2014 14:57:03 +0000 Received: from eris.apache.org (localhost [127.0.0.1]) by eris.apache.org (Postfix) with ESMTP id 43A7323889D5 for ; Thu, 4 Sep 2014 14:56:43 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r921243 - in /websites/staging/mahout/trunk/content: ./ users/recommender/intro-cooccurrence-spark.html Date: Thu, 04 Sep 2014 14:56:43 -0000 To: commits@mahout.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20140904145643.43A7323889D5@eris.apache.org> X-Virus-Checked: Checked by ClamAV on apache.org Author: buildbot Date: Thu Sep 4 14:56:42 2014 New Revision: 921243 Log: Staging update by buildbot for mahout Modified: websites/staging/mahout/trunk/content/ (props changed) websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html Propchange: websites/staging/mahout/trunk/content/ ------------------------------------------------------------------------------ --- cms:source-revision (original) +++ cms:source-revision Thu Sep 4 14:56:42 2014 @@ -1 +1 @@ -1621598 +1622492 Modified: websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html ============================================================================== --- websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html (original) +++ websites/staging/mahout/trunk/content/users/recommender/intro-cooccurrence-spark.html Thu Sep 4 14:56:42 2014 @@ -261,16 +261,15 @@ creating recommendations or similar item For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want to recommend.

-
spark-itemsimilarity Mahout 1.0-SNAPSHOT
+
spark-itemsimilarity Mahout 1.0
 Usage: spark-itemsimilarity [options]
 
+Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket'
 Input, output options
   -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list of 
-        HDFS supported URIs (required)
+        Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
   -i2 <value> | --input2 <value>
-        Secondary input path for cross-similarity calculation, same restrictions 
-        as "--input" (optional). Default: empty.
+        Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty.
   -o <value> | --output <value>
         Path for output, any local or HDFS supported URI (required)
 
@@ -278,8 +277,7 @@ to recommend.   

-mppu <value> | --maxPrefs <value> Max number of preferences to consider per user (optional). Default: 500 -m <value> | --maxSimilaritiesPerItem <value> - Limit the number of similarities per item to this number (optional). - Default: 100 + Limit the number of similarities per item to this number (optional). Default: 100 Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. @@ -287,56 +285,42 @@ to recommend.

-id <value> | --inDelim <value> Input delimiter character (optional). Default: "[,\t]" -f1 <value> | --filter1 <value> - String (or regex) whose presence indicates a datum for the primary item - set (optional). Default: no filter, all data is used + String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used -f2 <value> | --filter2 <value> - String (or regex) whose presence indicates a datum for the secondary item - set (optional). If not present no secondary dataset is collected - -rc <value> | --rowIDPosition <value> - Column number (0 based Int) containing the row ID string (optional). - Default: 0 - -ic <value> | --itemIDPosition <value> - Column number (0 based Int) containing the item ID string (optional). - Default: 1 - -fc <value> | --filterPosition <value> - Column number (0 based Int) containing the filter string (optional). - Default: -1 for no filter + String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected + -rc <value> | --rowIDColumn <value> + Column number (0 based Int) containing the row ID string (optional). Default: 0 + -ic <value> | --itemIDColumn <value> + Column number (0 based Int) containing the item ID string (optional). Default: 1 + -fc <value> | --filterColumn <value> + Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used File discovery options: -r | --recursive - Searched the -i path recursively for files that match --filenamePattern - (optional), default: false + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false -fp <value> | --filenamePattern <value> - Regex to match in determining input files (optional). Default: filename - in the --input option or "^part-.*" if --input is a directory + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory Output text file schema options: -rd <value> | --rowKeyDelim <value> - Separates the rowID key from the vector values list (optional). Default: -\t" + Separates the rowID key from the vector values list (optional). Default: "\t" -cd <value> | --columnIdStrengthDelim <value> - Separates column IDs from their values in the vector values list (optional). - Default: ":" + Separates column IDs from their values in the vector values list (optional). Default: ":" -td <value> | --elementDelim <value> Separates vector element values in the values list (optional). Default: " " -os | --omitStrength Do not write the strength to the output files (optional), Default: false. - This option is used to output indexable data for creating a search engine - recommender. +This option is used to output indexable data for creating a search engine recommender. Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." Spark config options: -ma <value> | --master <value> - Spark Master URL (optional). Default: "local". Note that you can specify - the number of cores to get a performance improvement, for example "local[4]" + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" -sem <value> | --sparkExecutorMem <value> - Max Java heap available as "executor memory" on each node (optional). - Default: 4g - -General config options: + Max Java heap available as "executor memory" on each node (optional). Default: 4g -rs <value> | --randomSeed <value> -h | --help @@ -472,61 +456,48 @@ to recommend.

spark-rowsimilarity is the companion to spark-itemsimilarity the primary difference is that it takes a text file version of a DRM with optional application specific IDs. The input is in text-delimited form where there are three delimiters used. By default it reads (rowIDcolumnID1:strength1columnID2:strength2...) Since this job only supports LLR similarity, which does not use the input strengths, they may be omitted in the input. It writes (columnIDcolumnID1:strength1columnID2:strength2...) The output is sorted by strength descending. The output can be interpreted as a column id from the primary input followed by a list of the most similar columns. For a discussion of the output layout and formatting see spark-itemsimilarity.

One significant output option is --omitStrength. This allows output of the form (columnIDcolumnID2columnID2...) This is a tab-delimited file containing a columnID token followed by a space delimited string of tokens. It can be directly indexed by search engines to create an item-based recommender.

The command line interface is:

-
spark-rowsimilarity Mahout 1.0-SNAPSHOT
+
spark-rowsimilarity Mahout 1.0
 Usage: spark-rowsimilarity [options]
 
 Input, output options
   -i <value> | --input <value>
-        Input path, may be a filename, directory name, or comma delimited list 
-        of HDFS supported URIs (required)
- -o <value> | --output <value>
+        Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
+  -o <value> | --output <value>
         Path for output, any local or HDFS supported URI (required)
 
 Algorithm control options:
   -mo <value> | --maxObservations <value>
         Max number of observations to consider per row (optional). Default: 500
   -m <value> | --maxSimilaritiesPerRow <value>
-        Limit the number of similarities per item to this number (optional). 
-        Default: 100
+        Limit the number of similarities per item to this number (optional). Default: 100
 
 Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
+Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket'
 
 Output text file schema options:
   -rd <value> | --rowKeyDelim <value>
-        Separates the rowID key from the vector values list (optional). 
-        Default: "\t"
+        Separates the rowID key from the vector values list (optional). Default: "\t"
   -cd <value> | --columnIdStrengthDelim <value>
-        Separates column IDs from their values in the vector values list 
-        (optional). Default: ":"
+        Separates column IDs from their values in the vector values list (optional). Default: ":"
   -td <value> | --elementDelim <value>
-        Separates vector element values in the values list (optional). 
-        Default: " "
+        Separates vector element values in the values list (optional). Default: " "
   -os | --omitStrength
-        Do not write the strength to the output files (optional), Default: 
-        false.
-This option is used to output indexable data for creating a search engine 
-recommender.
+        Do not write the strength to the output files (optional), Default: false.
+This option is used to output indexable data for creating a search engine recommender.
 
 Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
 
 File discovery options:
   -r | --recursive
-        Searched the -i path recursively for files that match 
-        --filenamePattern (optional), Default: false
+        Searched the -i path recursively for files that match --filenamePattern (optional), Default: false
   -fp <value> | --filenamePattern <value>
-        Regex to match in determining input files (optional). Default: 
-        filename in the --input option or "^part-.*" if --input is a directory
+        Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory
 
 Spark config options:
   -ma <value> | --master <value>
-        Spark Master URL (optional). Default: "local". Note that you can 
-        specify the number of cores to get a performance improvement, for 
-        example "local[4]"
+        Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]"
   -sem <value> | --sparkExecutorMem <value>
-        Max Java heap available as "executor memory" on each node (optional). 
-        Default: 4g
-
-General config options:
+        Max Java heap available as "executor memory" on each node (optional). Default: 4g
   -rs <value> | --randomSeed <value>
 
   -h | --help