Return-Path:
spark-itemsimilarity Mahout 1.0-SNAPSHOT +spark-itemsimilarity Mahout 1.0 Usage: spark-itemsimilarity [options] +Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket' Input, output options -i <value> | --input <value> - Input path, may be a filename, directory name, or comma delimited list of - HDFS supported URIs (required) + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) -i2 <value> | --input2 <value> - Secondary input path for cross-similarity calculation, same restrictions - as "--input" (optional). Default: empty. + Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty. -o <value> | --output <value> Path for output, any local or HDFS supported URI (required) @@ -278,8 +277,7 @@ to recommend. -mppu <value> | --maxPrefs <value> Max number of preferences to consider per user (optional). Default: 500 -m <value> | --maxSimilaritiesPerItem <value> - Limit the number of similarities per item to this number (optional). - Default: 100 + Limit the number of similarities per item to this number (optional). Default: 100 Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. @@ -287,56 +285,42 @@ to recommend. -id <value> | --inDelim <value> Input delimiter character (optional). Default: "[,\t]" -f1 <value> | --filter1 <value> - String (or regex) whose presence indicates a datum for the primary item - set (optional). Default: no filter, all data is used + String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used -f2 <value> | --filter2 <value> - String (or regex) whose presence indicates a datum for the secondary item - set (optional). If not present no secondary dataset is collected - -rc <value> | --rowIDPosition <value> - Column number (0 based Int) containing the row ID string (optional). - Default: 0 - -ic <value> | --itemIDPosition <value> - Column number (0 based Int) containing the item ID string (optional). - Default: 1 - -fc <value> | --filterPosition <value> - Column number (0 based Int) containing the filter string (optional). - Default: -1 for no filter + String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected + -rc <value> | --rowIDColumn <value> + Column number (0 based Int) containing the row ID string (optional). Default: 0 + -ic <value> | --itemIDColumn <value> + Column number (0 based Int) containing the item ID string (optional). Default: 1 + -fc <value> | --filterColumn <value> + Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used File discovery options: -r | --recursive - Searched the -i path recursively for files that match --filenamePattern - (optional), default: false + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false -fp <value> | --filenamePattern <value> - Regex to match in determining input files (optional). Default: filename - in the --input option or "^part-.*" if --input is a directory + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory Output text file schema options: -rd <value> | --rowKeyDelim <value> - Separates the rowID key from the vector values list (optional). Default: -\t" + Separates the rowID key from the vector values list (optional). Default: "\t" -cd <value> | --columnIdStrengthDelim <value> - Separates column IDs from their values in the vector values list (optional). - Default: ":" + Separates column IDs from their values in the vector values list (optional). Default: ":" -td <value> | --elementDelim <value> Separates vector element values in the values list (optional). Default: " " -os | --omitStrength Do not write the strength to the output files (optional), Default: false. - This option is used to output indexable data for creating a search engine - recommender. +This option is used to output indexable data for creating a search engine recommender. Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." Spark config options: -ma <value> | --master <value> - Spark Master URL (optional). Default: "local". Note that you can specify - the number of cores to get a performance improvement, for example "local[4]" + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" -sem <value> | --sparkExecutorMem <value> - Max Java heap available as "executor memory" on each node (optional). - Default: 4g - -General config options: + Max Java heap available as "executor memory" on each node (optional). Default: 4g -rs <value> | --randomSeed <value> -h | --help @@ -472,61 +456,48 @@ to recommend.spark-rowsimilarity is the companion to spark-itemsimilarity the primary difference is that it takes a text file version of a DRM with optional application specific IDs. The input is in text-delimited form where there are three delimiters used. By default it reads (rowID
columnID1:strength1 columnID2:strength2...) Since this job only supports LLR similarity, which does not use the input strengths, they may be omitted in the input. It writes (columnID columnID1:strength1 columnID2:strength2...) The output is sorted by strength descending. The output can be interpreted as a column id from the primary input followed by a list of the most similar columns. For a discussion of the output layout and formatting see spark-itemsimilarity. One significant output option is --omitStrength. This allows output of the form (columnID
columnID2 columnID2 ...) This is a tab-delimited file containing a columnID token followed by a space delimited string of tokens. It can be directly indexed by search engines to create an item-based recommender. The command line interface is:
-spark-rowsimilarity Mahout 1.0-SNAPSHOT +spark-rowsimilarity Mahout 1.0 Usage: spark-rowsimilarity [options] Input, output options -i <value> | --input <value> - Input path, may be a filename, directory name, or comma delimited list - of HDFS supported URIs (required) - -o <value> | --output <value> + Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) + -o <value> | --output <value> Path for output, any local or HDFS supported URI (required) Algorithm control options: -mo <value> | --maxObservations <value> Max number of observations to consider per row (optional). Default: 500 -m <value> | --maxSimilaritiesPerRow <value> - Limit the number of similarities per item to this number (optional). - Default: 100 + Limit the number of similarities per item to this number (optional). Default: 100 Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. +Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket' Output text file schema options: -rd <value> | --rowKeyDelim <value> - Separates the rowID key from the vector values list (optional). - Default: "\t" + Separates the rowID key from the vector values list (optional). Default: "\t" -cd <value> | --columnIdStrengthDelim <value> - Separates column IDs from their values in the vector values list - (optional). Default: ":" + Separates column IDs from their values in the vector values list (optional). Default: ":" -td <value> | --elementDelim <value> - Separates vector element values in the values list (optional). - Default: " " + Separates vector element values in the values list (optional). Default: " " -os | --omitStrength - Do not write the strength to the output files (optional), Default: - false. -This option is used to output indexable data for creating a search engine -recommender. + Do not write the strength to the output files (optional), Default: false. +This option is used to output indexable data for creating a search engine recommender. Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..." File discovery options: -r | --recursive - Searched the -i path recursively for files that match - --filenamePattern (optional), Default: false + Searched the -i path recursively for files that match --filenamePattern (optional), Default: false -fp <value> | --filenamePattern <value> - Regex to match in determining input files (optional). Default: - filename in the --input option or "^part-.*" if --input is a directory + Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory Spark config options: -ma <value> | --master <value> - Spark Master URL (optional). Default: "local". Note that you can - specify the number of cores to get a performance improvement, for - example "local[4]" + Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" -sem <value> | --sparkExecutorMem <value> - Max Java heap available as "executor memory" on each node (optional). - Default: 4g - -General config options: + Max Java heap available as "executor memory" on each node (optional). Default: 4g -rs <value> | --randomSeed <value> -h | --help