mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From p..@apache.org
Subject svn commit: r1629066 - /mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
Date Thu, 02 Oct 2014 20:47:28 GMT
Author: pat
Date: Thu Oct  2 20:47:28 2014
New Revision: 1629066

URL: http://svn.apache.org/r1629066
Log:
CMS commit to mahout by pat

Modified:
    mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext?rev=1629066&r1=1629065&r2=1629066&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
(original)
+++ mahout/site/mahout_cms/trunk/content/users/recommender/intro-cooccurrence-spark.mdtext
Thu Oct  2 20:47:28 2014
@@ -290,11 +290,18 @@ See RowSimilarityDriver.scala in Mahout'
 
 #3. Using *spark-rowsimilarity* with Text Data
 
-Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance
given the content of a blog post, which other posts are similar. In this case the columns
are terms and the rows are documents. Since LLR is the only similarity method supported this
is not the optimal way to determine document similarity. LLR is used more as a quality of
similarity filter than as a similarity measure. However *spark-rowsimilarity* will produce
lists of similar docs for every doc. The Apache [Lucene](http://lucene.apache.org) project
provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
documents.
+Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance
given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and the rows are
posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine general "bag-of-words"
document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity*
will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org)
project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description)
documents.
 
 #<a name="unified-recommender">4. Creating a Unified Recommender</a>
 
-Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a unified
cooccurrnce and content based recommender that can be used in both or either mode depending
on indicators available and the history available at runtime for a user.
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a unified
cooccurrence and content based
+ recommender that can be used in both or either mode depending on indicators available and
the history available at 
+runtime for a user.
 
 ##Requirements
 
@@ -303,6 +310,26 @@ Using the output of *spark-itemsimilarit
 3. Spark, the correct version for your version of Mahout and Hadoop
 4. A search engine like Solr or Elasticsearch
 
+##Indicators
+
+Indicators come in 3 types
+
+1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+2. **Content**: calculated from item metadata or content using *spark-rowsimilarity*
+3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item.
+
+The query for recommendations will be a mix of values meant to match one of your indicators.
The query can be constructed 
+from user history and values derived from context (category being viewed for instance) or
special precalculated data 
+(popularity rank for instance). This blending of indicators allows for creating many flavors
or recommendations to fit 
+a very wide variety of circumstances. It allows recommendations to be made for items with
no usage data and even allows 
+for gracefully degrading recommendations based on how much user history is available. 
+
+With the right mix of indicators developers can construct a single query that works for completely
new items and new users 
+while working well for items with lots of interactions and users with many recorded actions.
In other words adding in content and intrinsic 
+indicators allows developers to create a solution for the "cold-start" problem that gracefully
improves with more user history
+and as items have more interactions. It is also possible to create a completely content-based
recommender that personalizes 
+recommendations.
+
 ##Example with 3 Indicators
 
 You will need to decide how you store user action data so they can be processed by the item
and row similarity jobs and this is most easily done by using text files as described above.
The data that is processed by these jobs is considered the **training data**. You will need
some amount of user history in your recs query. It is typical to use the most recent user
history but need not be exactly what is in the training set, which may include more historical
data. Keeping the user history for query purposes could be done with a database by referencing
some history from a users table. In the example above the two collaborative filtering actions
are "purchase" and "view", but let's also add tags (taken from catalog categories or other
descriptive metadata). 
@@ -349,7 +376,10 @@ We now have three indicators, two collab
 
 ##Unified Recommender Query
 
-The actual form of the query for recommendations will vary depending on your search engine
but the intent is the same. For a given user, map their history of an action or content to
the correct indicator field and preform and OR'd the query. This will allow matches from any
indicator where AND queries require that an item have some similarity to all indicator fields.
+The actual form of the query for recommendations will vary depending on your search engine
but the intent is the same. 
+For a given user, map their history of an action or content to the correct indicator field
and perform an OR'd query. 
+This will allow matches from any indicator where AND queries require that an item have some
similarity to all indicator 
+fields.
 
 We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them
"purchase", "view", and "tags". We take the user's history that corresponds to each indicator
and create a query of the form:
 
@@ -360,7 +390,19 @@ We have 3 indicators, these are indexed 
       
 The query will result in an ordered list of items recommended for purchase but skewed towards
items with similar tags to the ones the user has already purchased. 
 
-This is only an example and not necessarily the optimal way to create recs. It illustrates
how business decisions can be translated into recommendations. This technique can be used
to skew recommendations towards intrinsic indicators also. For instance you may want to put
personalized popular item recs in a special place in the UI. Create a popularity indicator
using whatever method you want and index that as a new indicator field and include the corresponding
value in a query on the popularity field. 
+This is only an example and not necessarily the optimal way to create recs. It illustrates
how business decisions can be 
+translated into recommendations. This technique can be used to skew recommendations towards
intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special place in the
UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for instance) then
+index that as a new indicator field and include the corresponding value in a query 
+on the popularity field. If we use the ecom example but use the query to get "hot" recommendations
it might look like this:
+
+   Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"
+
+This will return recommendations favoring ones that have the intrinsic indicator "hot".
 
 ##Notes
 1. Use as much user action history as you can gather. Choose a primary action that is closest
to what you want to recommend and the others will be used to create cross-indicators. Using
more data in this fashion will almost always produce better recommendations.



Mime
View raw message