Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 362F0200D61 for ; Tue, 5 Dec 2017 03:54:10 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 3007D160C1D; Tue, 5 Dec 2017 02:54:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D0A91160C05 for ; Tue, 5 Dec 2017 03:54:07 +0100 (CET) Received: (qmail 73368 invoked by uid 500); 5 Dec 2017 02:54:03 -0000 Mailing-List: contact commits-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list commits@mahout.apache.org Received: (qmail 71094 invoked by uid 99); 5 Dec 2017 02:53:59 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Dec 2017 02:53:59 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 46103F6135; Tue, 5 Dec 2017 02:53:57 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: rawkintrevo@apache.org To: commits@mahout.apache.org Date: Tue, 05 Dec 2017 02:54:17 -0000 Message-Id: <7e9873c49290441bab4334bd359c939b@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [21/52] [partial] mahout git commit: MAHOUT-1981 Merged site updates, fixed navbars, Mathjax archived-at: Tue, 05 Dec 2017 02:54:10 -0000 http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md b/website-old/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md deleted file mode 100644 index 9561843..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/intro-cooccurrence-spark.md +++ /dev/null @@ -1,437 +0,0 @@ ---- -layout: default -title: (Deprecated) Perceptron and Winnow -theme: - name: retro-mahout ---- - -#Intro to Cooccurrence Recommenders with Spark - -Mahout provides several important building blocks for creating recommendations using Spark. *spark-itemsimilarity* can -be used to create "other people also liked these things" type recommendations and paired with a search engine can -personalize recommendations for individual users. *spark-rowsimilarity* can provide non-personalized content based -recommendations and when paired with a search engine can be used to personalize content based recommendations. - -##References - -1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) -2. A slide deck, which talks about mixing actions or other indicators: [Creating a Unified Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) -3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) -and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) -3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. - -Below are the command line jobs but the drivers and associated code can also be customized and accessed from the Scala APIs. - -##1. spark-itemsimilarity -*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce job called *itemsimilarity*. It takes in elements of interactions, which have userID, itemID, and optionally a value. It will produce one of more indicator matrices created by comparing every user's interactions with every other user. The indicator matrix is an item x item matrix where the values are log-likelihood ratio strengths. For the legacy mapreduce version, there were several possible similarity measures but these are being deprecated in favor of LLR because in practice it performs the best. - -Mahout's mapreduce version of itemsimilarity takes a text file that is expected to have user and item IDs that conform to -Mahout's ID requirements--they are non-negative integers that can be viewed as row and column numbers in a matrix. - -*spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will -account for multi-modal interactions and create indicator matrices allowing the use of much more data in -creating recommendations or similar item lists. People try to do this by mixing different actions and giving them weights. -For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's -cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want -to recommend. - - - spark-itemsimilarity Mahout 1.0 - Usage: spark-itemsimilarity [options] - - Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket' - Input, output options - -i | --input - Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) - -i2 | --input2 - Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty. - -o | --output - Path for output, any local or HDFS supported URI (required) - - Algorithm control options: - -mppu | --maxPrefs - Max number of preferences to consider per user (optional). Default: 500 - -m | --maxSimilaritiesPerItem - Limit the number of similarities per item to this number (optional). Default: 100 - - Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. - - Input text file schema options: - -id | --inDelim - Input delimiter character (optional). Default: "[,\t]" - -f1 | --filter1 - String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used - -f2 | --filter2 - String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected - -rc | --rowIDColumn - Column number (0 based Int) containing the row ID string (optional). Default: 0 - -ic | --itemIDColumn - Column number (0 based Int) containing the item ID string (optional). Default: 1 - -fc | --filterColumn - Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter - - Using all defaults the input is expected of the form: "userIDitemId" or "userIDitemIDany-text..." and all rows will be used - - File discovery options: - -r | --recursive - Searched the -i path recursively for files that match --filenamePattern (optional), Default: false - -fp | --filenamePattern - Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory - - Output text file schema options: - -rd | --rowKeyDelim - Separates the rowID key from the vector values list (optional). Default: "\t" - -cd | --columnIdStrengthDelim - Separates column IDs from their values in the vector values list (optional). Default: ":" - -td | --elementDelim - Separates vector element values in the values list (optional). Default: " " - -os | --omitStrength - Do not write the strength to the output files (optional), Default: false. - This option is used to output indexable data for creating a search engine recommender. - - Default delimiters will produce output of the form: "itemID1itemID2:value2itemID10:value10..." - - Spark config options: - -ma | --master - Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" - -sem | --sparkExecutorMem - Max Java heap available as "executor memory" on each node (optional). Default: 4g - -rs | --randomSeed - - -h | --help - prints this usage text - -This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations. - -See ItemSimilarityDriver.scala in Mahout's spark module if you want to customize the code. - -###Defaults in the _**spark-itemsimilarity**_ CLI - -If all defaults are used the input can be as simple as: - - userID1,itemID1 - userID2,itemID2 - ... - -With the command line: - - - bash$ mahout spark-itemsimilarity --input in-file --output out-dir - - -This will use the "local" Spark context and will output the standard text version of a DRM - - itemID1itemID2:value2itemID10:value10... - -###How To Use Multiple User Actions - -Often we record various actions the user takes for later analytics. These can now be used to make recommendations. -The idea of a recommender is to recommend the action you want the user to make. For an ecom app this might be -a purchase action. It is usually not a good idea to just treat other actions the same as the action you want to recommend. -For instance a view of an item does not indicate the same intent as a purchase and if you just mixed the two together you -might even make worse recommendations. It is tempting though since there are so many more views than purchases. With *spark-itemsimilarity* -we can now use both actions. Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict purchases. -We do this by treating the primary action (purchase) as data for the indicator matrix and use the secondary action (view) -to calculate the cross-cooccurrence indicator matrix. - -*spark-itemsimilarity* can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed -action log of the form: - - u1,purchase,iphone - u1,purchase,ipad - u2,purchase,nexus - u2,purchase,galaxy - u3,purchase,surface - u4,purchase,iphone - u4,purchase,galaxy - u1,view,iphone - u1,view,ipad - u1,view,nexus - u1,view,galaxy - u2,view,iphone - u2,view,ipad - u2,view,nexus - u2,view,galaxy - u3,view,surface - u3,view,nexus - u4,view,iphone - u4,view,ipad - u4,view,galaxy - -###Command Line - - -Use the following options: - - bash$ mahout spark-itemsimilarity \ - --input in-file \ # where to look for data - --output out-path \ # root dir for output - --master masterUrl \ # URL of the Spark master server - --filter1 purchase \ # word that flags input for the primary action - --filter2 view \ # word that flags input for the secondary action - --itemIDPosition 2 \ # column that has the item ID - --rowIDPosition 0 \ # column that has the user ID - --filterPosition 1 # column that has the filter word - - - -###Output - -The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating -cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created - - out-path - |-- similarity-matrix - TDF part files - \-- cross-similarity-matrix - TDF part-files - -The indicator matrix will contain the lines: - - galaxy\tnexus:1.7260924347106847 - ipad\tiphone:1.7260924347106847 - nexus\tgalaxy:1.7260924347106847 - iphone\tipad:1.7260924347106847 - surface - -The cross-cooccurrence indicator matrix will contain: - - iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 - ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 - nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897 - galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847 - surface\tsurface:4.498681156950466 nexus:0.6795961471815897 - -**Note:** You can run this multiple times to use more than two actions or you can use the underlying -SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators. - -###Log File Input - -A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form: - - 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone - 2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad - 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus - 2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy - 2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface - 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone - 2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy - 2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone - 2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad - 2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus - 2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy - 2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone - 2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad - 2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus - 2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy - 2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface - 2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus - 2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone - 2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad - 2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy - -Can be parsed with the following CLI and run on the cluster producing the same output as the above example. - - bash$ mahout spark-itemsimilarity \ - --input in-file \ - --output out-path \ - --master spark://sparkmaster:4044 \ - --filter1 purchase \ - --filter2 view \ - --inDelim "\t" \ - --itemIDPosition 4 \ - --rowIDPosition 1 \ - --filterPosition 2 - -##2. spark-rowsimilarity - -*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary difference is that it takes a text file version of -a matrix of sparse vectors with optional application specific IDs and it finds similar rows rather than items (columns). Its use is -not limited to collaborative filtering. The input is in text-delimited form where there are three delimiters used. By -default it reads (rowID<tab>columnID1:strength1<space>columnID2:strength2...) Since this job only supports LLR similarity, - which does not use the input strengths, they may be omitted in the input. It writes -(rowID<tab>rowID1:strength1<space>rowID2:strength2...) -The output is sorted by strength descending. The output can be interpreted as a row ID from the primary input followed -by a list of the most similar rows. - -The command line interface is: - - spark-rowsimilarity Mahout 1.0 - Usage: spark-rowsimilarity [options] - - Input, output options - -i | --input - Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required) - -o | --output - Path for output, any local or HDFS supported URI (required) - - Algorithm control options: - -mo | --maxObservations - Max number of observations to consider per row (optional). Default: 500 - -m | --maxSimilaritiesPerRow - Limit the number of similarities per item to this number (optional). Default: 100 - - Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure. - Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket' - - Output text file schema options: - -rd | --rowKeyDelim - Separates the rowID key from the vector values list (optional). Default: "\t" - -cd | --columnIdStrengthDelim - Separates column IDs from their values in the vector values list (optional). Default: ":" - -td | --elementDelim - Separates vector element values in the values list (optional). Default: " " - -os | --omitStrength - Do not write the strength to the output files (optional), Default: false. - This option is used to output indexable data for creating a search engine recommender. - - Default delimiters will produce output of the form: "itemID1itemID2:value2itemID10:value10..." - - File discovery options: - -r | --recursive - Searched the -i path recursively for files that match --filenamePattern (optional), Default: false - -fp | --filenamePattern - Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory - - Spark config options: - -ma | --master - Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]" - -sem | --sparkExecutorMem - Max Java heap available as "executor memory" on each node (optional). Default: 4g - -rs | --randomSeed - - -h | --help - prints this usage text - -See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. - -#3. Using *spark-rowsimilarity* with Text Data - -Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance given the tags associated with -a blog post, - which other posts have similar tags. In this case the columns are tags and the rows are posts. Since LLR is -the only similarity method supported this is not the optimal way to determine general "bag-of-words" document similarity. -LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity* will produce -lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents. - -#4. Creating a Unified Recommender - -Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a unified cooccurrence and content based - recommender that can be used in both or either mode depending on indicators available and the history available at -runtime for a user. - -##Requirements - -1. Mahout 0.10.0 or later -2. Hadoop -3. Spark, the correct version for your version of Mahout and Hadoop -4. A search engine like Solr or Elasticsearch - -##Indicators - -Indicators come in 3 types - -1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions -2. **Content**: calculated from item metadata or content using *spark-rowsimilarity* -3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item. - -The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed -from user history and values derived from context (category being viewed for instance) or special precalculated data -(popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit -a very wide variety of circumstances. - -With the right mix of indicators developers can construct a single query that works for completely new items and new users -while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic -indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history -and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes -recommendations. - -##Example with 3 Indicators - -You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and -this is most easily done by using text files as described above. The data that is processed by these jobs is considered the -training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history -but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user -history for query purposes could be done with a database by storing it in a users table. In the example above the two -collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other -descriptive metadata). - -We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator -from the secondary action (view) -and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once and *spark-rowsimilarity* once. - -We have described how to create the collaborative filtering indicator and cross-cooccurrence indicator for purchase and view (the [How to use Multiple User -Actions](#multiple-actions) section) but tags will be a slightly different process. We want to use the fact that -certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator -but rather a "content" or "metadata" type indicator since you are not using other users' history, only the -individual that you are making recs for. This means that this method will make recommendations for items that have -no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one - has purchased or viewed them yet. In the final query we will mix all 3 indicators. - -##Content Indicator - -To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find -items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a -content or metadata indicator. They are used when you want to find items that are similar to other items by using their -content or metadata, not by which users interacted with them. - -For this we need input of the form: - - itemIDlist-of-tags - ... - -The full collection will look like the tags column from a catalog DB. For our ecom example it might be: - - 3459860bmen long-sleeve chambray clothing casual - 9446577dwomen tops chambray clothing casual - ... - -We'll use *spark-rowimilairity* because we are looking for similar rows, which encode items in this case. As with the -collaborative filtering indicator and cross-cooccurrence indicator we use the --omitStrength option. The strengths created are -probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling -is finished we no longer need the strengths. We will get an indicator matrix of the form: - - itemIDlist-of-item IDs - ... - -This is a content indicator since it has found other items with similar content or metadata. - - 3459860b3459860b 3459860b 6749860c 5959860a 3434860a 3477860a - 9446577d9446577d 9496577d 0943577d 8346577d 9442277d 9446577e - ... - -We now have three indicators, two collaborative filtering type and one content type. - -##Unified Recommender Query - -The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. -For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. - -We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". -We take the user's history that corresponds to each indicator and create a query of the form: - - Query: - field: purchase; q:user's-purchase-history - field: view; q:user's view-history - field: tags; q:user's-tags-associated-with-purchases - -The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to -the ones the user has already purchased. - -This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be -translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. -For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator -by tagging items with some category of popularity (hot, warm, cold for instance) then -index that as a new indicator field and include the corresponding value in a query -on the popularity field. If we use the ecom example but use the query to get "hot" recommendations it might look like this: - - Query: - field: purchase; q:user's-purchase-history - field: view; q:user's view-history - field: popularity; q:"hot" - -This will return recommendations favoring ones that have the intrinsic indicator "hot". - -##Notes -1. Use as much user action history as you can gather. Choose a primary action that is closest to what you want to recommend and the others will be used to create cross-cooccurrence indicators. Using more data in this fashion will almost always produce better recommendations. -2. Content can be used where there is no recorded user behavior or when items change too quickly to get much interaction history. They can be used alone or mixed with other indicators. -3. Most search engines support "boost" factors so you can favor one or more indicators. In the example query, if you want tags to only have a small effect you could boost the CF indicators. -4. In the examples we have used space delimited strings for lists of IDs in indicators and in queries. It may be better to use arrays of strings if your storage system and search engine support them. For instance Solr allows multi-valued fields, which correspond to arrays. http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md b/website-old/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md deleted file mode 100644 index 9537669..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/intro-itembased-hadoop.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -layout: default -title: (Deprecated) Perceptron and Winnow -theme: - name: retro-mahout ---- -# Introduction to Item-Based Recommendations with Hadoop - -##Overview - -Mahout’s item based recommender is a flexible and easily implemented algorithm with a diverse range of applications. The minimalism of the primary input file’s structure and availability of ancillary filtering controls can make sourcing required data and shaping a desired output both efficient and straightforward. - -Typical use cases include: - -* Recommend products to customers via an eCommerce platform (think: Amazon, Netflix, Overstock) -* Identify organic sales opportunities -* Segment users/customers based on similar item preferences - -Broadly speaking, Mahout's item-based recommendation algorithm takes as input customer preferences by item and generates an output recommending similar items with a score indicating whether a customer will "like" the recommended item. - -One of the strengths of the item based recommender is its adaptability to your business conditions or research interests. For example, there are many available approaches for providing product preference. One such method is to calculate the total orders for a given product for each customer (i.e. Acme Corp has ordered Widget-A 5,678 times) while others rely on user preference captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a product two thumbs’ up). - -Additionally, a variety of methodologies can be implemented to narrow the focus of Mahout's recommendations, such as: - -* Exclude low volume or low profitability products from consideration -* Group customers by segment or market rather than using user/customer level data -* Exclude zero-dollar transactions, returns or other order types -* Map product substitutions into the Mahout input (i.e. if WidgetA is a recommended item replace it with WidgetX) - -The item based recommender output can be easily consumed by downstream applications (i.e. websites, ERP systems or salesforce automation tools) and is configurable so users can determine the number of item recommendations generated by the algorithm. - -##Example - -Testing the item based recommender can be a simple and potentially quite rewarding endeavor. Whereas the typical sample use case for collaborative filtering focuses on utilization of, and integration with, eCommerce platforms we can instead look at a potential use case applicable to most businesses (even those without a web presence). Let’s look at how a company might use Mahout’s item based recommender to identify new sales opportunities for an existing customer base. First, you’ll need to get Mahout up and running, the instructions for which can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've ensured Mahout is properly installed, we’re ready to run a quick example. - -**Step 1: Gather some test data** - -Mahout’s item based recommender relies on three key pieces of data: *userID*, *itemID* and *preference*. The “users” could be website visitors or simply customers that purchase products from your business. Similarly, items could be products, product groups or even pages on your website – really anything you would want to recommend to a group of users or customers. For our example let’s use customer orders as a proxy for preference. A simple count of distinct orders by customer, by product will work for this example. You’ll find as you explore ways to manipulate the item based recommender the preference value can be many things (page clicks, explicit ratings, order counts, etc.). Once your test data is gathered put it in a *.txt* file separated by commas with no column headers included. - -**Step 2: Pick a similarity measure** - -Choosing a similarity measure for use in a production environment is something that requires careful testing, evaluation and research. For our example purposes, we’ll just go with a Mahout similarity classname called *SIMILARITY_LOGLIKELIHOOD*. - -**Step 3: Configure the Mahout command** - -Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command: - - $ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output --numRecommendations 25 - -Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the *userID* and an array of *itemIDs* and scores. - -**Step 4: Making use of the output and doing more with Mahout** - -The output file generated in our simple example can be transformed using your tool of choice and consumed by downstream applications. There exist a variety of configuration options for Mahout’s item based recommender to accommodate custom business requirements; exploring and testing various configurations to suit your needs will doubtless lead to additional questions. Our user community is accessible via our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html) and the book *Mahout In Action* is a fantastic (but slightly outdated) starting point. http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/matrix-factorization.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/matrix-factorization.md b/website-old/docs/tutorials/map-reduce/recommender/matrix-factorization.md deleted file mode 100644 index 4cff138..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/matrix-factorization.md +++ /dev/null @@ -1,187 +0,0 @@ ---- -layout: default -title: (Deprecated) Perceptron and Winnow -theme: - name: retro-mahout ---- - -# Introduction to Matrix Factorization for Recommendation Mining - -In the mathematical discipline of linear algebra, a matrix decomposition -or matrix factorization is a dimensionality reduction technique that factorizes a matrix into a product of matrices, usually two. -There are many different matrix decompositions, each finds use among a particular class of problems. - -In mahout, the SVDRecommender provides an interface to build recommender based on matrix factorization. -The idea behind is to project the users and items onto a feature space and try to optimize U and M so that U \* (M^t) is as close to R as possible: - - U is n * p user feature matrix, - M is m * p item feature matrix, M^t is the conjugate transpose of M, - R is n * m rating matrix, - n is the number of users, - m is the number of items, - p is the number of features - -We usually use RMSE to represent the deviations between predictions and atual ratings. -RMSE is defined as the squared root of the sum of squared errors at each known user item ratings. -So our matrix factorization target could be mathmatically defined as: - - find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5)) - - SSE = sum(e(u,i)^2) - e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 0, 1, .. p - 1 - K is the number of known user item ratings. - - - -Mahout has implemented matrix factorization based on - - (1) SGD(Stochastic Gradient Descent) - (2) ALSWR(Alternating-Least-Squares with Weighted-λ-Regularization). - -## SGD - -Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a su of differentiable functions. - - Q(w) = sum(Q_i(w)), - -where w is the parameters to be estimated, - Q(w) is the objective function that could be expressed as sum of differentiable functions, - Q_i(w) is associated with the i-th observation in the data set - -In practice, w is estimated using an iterative method at each single sample until an approximate miminum is obtained, - - w = w - alpha * (d(Q_i(w))/dw), -where aplpha is the learning rate, - (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w. - -In matrix factorization, the RatingSGDFactorizer class implements the SGD with w = (U, M) and objective function Q(w) = sum(Q(u,i)), - - Q(u,i) = sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + (M[i,] * (M[i,]^t))] / 2 - -where Q(u, i) is the objecive function for user u and item i, - e(u, i) is the error between predicted rating and actual rating, - U[u,] is the feature vector of user u, - M[i,] is the feature vector of item i, - lambda is the regularization parameter to prevent overfitting. - -The algorithm is sketched as follows: - - init U and M with randomized value between 0.0 and 1.0 with standard Gaussian distribution - - for(iter = 0; iter < numIterations; iter++) - { - for(user u and item i with rating R[u,i]) - { - predicted_rating = U[u,] * M[i,]^t //dot product of feature vectors between user u and item i - err = R[u, i] - predicted_rating - //adjust U[u,] and M[i,] - // p is the number of features - for(f = 0; f < p; f++) { - NU[u,f] = U[u,f] - alpha * d(Q(u,i))/d(U[u,f]) //optimize U[u,f] - = U[u, f] + alpha * (e(u,i) * M[i,f] - lambda * U[u,f]) - } - for(f = 0; f < p; f++) { - M[i,f] = M[i,f] - alpha * d(Q(u,i))/d(M[i,f]) //optimize M[i,f] - = M[i,f] + alpha * (e(u,i) * U[u,f] - lambda * M[i,f]) - } - U[u,] = NU[u,] - } - } - -## SVD++ - -SVD++ is an enhancement of the SGD matrix factorization. - -It could be considered as an integration of latent factor model and neighborhood based model, considering not only how users rate, but also who has rated what. - -The complete model is a sum of 3 sub-models with complete prediction formula as follows: - - pr(u,i) = b[u,i] + fm + nm //user u and item i - - pr(u,i) is the predicted rating of user u on item i, - b[u,i] = U + b(u) + b(i) - fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])), j is an item in N(u) - nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + pow(|N(i;u;k)|, -0.5) * sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item in N(i;u;k) - -The associated regularized squared error function to be minimized is: - - {sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i])) - lambda * (b(u) * b(u) + b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * w[i,j0]) + sum(c[i,j1] * c[i,j1]))} - -b[u,i] is the baseline estimate of user u's predicted rating on item i. U is users' overall average rating and b(u) and b(i) indicate the observed deviations of user u and item i's ratings from average. - -The baseline estimate is to adjust for the user and item effects - i.e, systematic tendencies for some users to give higher ratings than others and tendencies -for some items to receive higher ratings than other items. - -fm is the latent factor model to capture the interactions between user and item via a feature layer. q[i,] is the feature vector of item i, and the rest part of the formula represents user u with a user feature vector and a sum of features of items in N(u), -N(u) is the set of items that user u have expressed preference, y[j,] is feature vector of an item in N(u). - -nm is an extension of the classic item-based neighborhood model. -It captures not only the user's explicit ratings but also the user's implicit preferences. R(i;u;k) is the set of items that have got explicit rating from user u and only retain top k most similar items. r[u,j0] is the actual rating of user u on item j0, -b[u,j0] is the corresponding baseline estimate. - -The difference between r[u,j0] and b[u,j0] is weighted by a parameter w[i,j0], which could be thought as the similarity between item i and j0. - -N[i;u;k] is the top k most similar items that have got the user's preference. -c[i;j1] is the paramter to be estimated. - -The value of w[i,j0] and c[i,j1] could be treated as the significance of the -user's explicit rating and implicit preference respectively. - -The parameters b, y, q, w, c are to be determined by minimizing the the associated regularized squared error function through gradient descent. We loop over all known ratings and for a given training case r[u,i], we apply gradient descent on the error function and modify the parameters by moving in the opposite direction of the gradient. - -For a complete analysis of the SVD++ algorithm, -please refer to the paper [Yehuda Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD 2008](http://research.yahoo.com/files/kdd08koren.pdf). - -In Mahout,SVDPlusPlusFactorizer class is a simplified implementation of the SVD++ algorithm.It mainly uses the latent factor model with item feature vector, user feature vector and user's preference, with pr(u,i) = fm = (q[i,]) \* (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) and the parameters to be determined are q, p, y. - -The update to q, p, y in each gradient descent step is: - - err(u,i) = r[u,i] - pr[u,i] - q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) - lamda * q[i,]) - p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,]) - for j that is an item in N(u): - y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - lambda * y[j,]) - -where alpha is the learning rate of gradient descent, N(u) is the items that user u has expressed preference. - -## Parallel SGD - -Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It shuffles the user ratings in every iteration and -generates splits on the shuffled ratings. Each split is handled by a thread to update the user features and item features using -vanilla SGD. - -The implementation could be traced back to a lock-free version of SGD based on paper -[Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf). - -## ALSWR - -ALSWR is an iterative algorithm to solve the low rank factorization of user feature matrix U and item feature matrix M. -The loss function to be minimized is formulated as the sum of squared errors plus [Tikhonov regularization](http://en.wikipedia.org/wiki/Tikhonov_regularization): - - L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * (sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2)) - -At the beginning of the algorithm, M is initialized with the average item ratings as its first row and random numbers for the rest row. - -In every iteration, we fix M and solve U by minimization of the cost function L(R, U, M), then we fix U and solve M by the minimization of -the cost function similarly. The iteration stops until a certain stopping criteria is met. - -To solve the matrix U when M is given, each user's feature vector is calculated by resolving a regularized linear least square error function -using the items the user has rated and their feature vectors: - - 1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 - -Similary, when M is updated, we resolve a regularized linear least square error function using feature vectors of the users that have rated the -item and their feature vectors: - - 1/2 * d(L(R,U,M)) / d(M[i,f]) = 0 - -The ALSWRFactorizer class is a non-distributed implementation of ALSWR using multi-threading to dispatch the computation among several threads. -Mahout also offers a [parallel map-reduce implementation](https://mahout.apache.org/users/recommender/intro-als-hadoop.html). - - -# Reference: - -[Stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent) - -[ALSWR](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf) - http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/quickstart.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/quickstart.md b/website-old/docs/tutorials/map-reduce/recommender/quickstart.md deleted file mode 100644 index 545c64a..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/quickstart.md +++ /dev/null @@ -1,32 +0,0 @@ ---- -layout: default -title: (Deprecated) Recommender Quickstart -theme: - name: retro-mahout ---- - -# Recommender Overview - -Recommenders have changed over the years. Mahout contains a long list of them, which you can still use. But to get the best out of our more modern aproach we'll need to think of the Recommender as a "model creation" component—supplied by Mahout's new spark-itemsimilarity job, and a "serving" component—supplied by a modern scalable search engine, like Solr. - -![image](http://i.imgur.com/fliHMBo.png) - -To integrate with your application you will collect user interactions storing them in a DB and also in a from usable by Mahout. The simplest way to do this is to log user interactions to csv files (user-id, item-id). The DB should be setup to contain the last n user interactions, which will form part of the query for recommendations. - -Mahout's spark-itemsimilarity will create a table of (item-id, list-of-similar-items) in csv form. Think of this as an item collection with one field containing the item-ids of similar items. Index this with your search engine. - -When your application needs recommendations for a specific person, get the latest user history of interactions from the DB and query the indicator collection with this history. You will get back an ordered list of item-ids. These are your recommendations. You may wish to filter out any that the user has already seen but that will depend on your use case. - -All ids for users and items are preserved as string tokens and so work as an external key in DBs or as doc ids for search engines, they also work as tokens for search queries. - -##References - -1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning) -2. A slide deck, which talks about mixing actions or other indicators: [Creating a Multimodal Recommender with Mahout and a Search Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/) -3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/) -and [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/) -3. A post describing the loglikelihood ratio: [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) LLR is used to reduce noise in the data while keeping the calculations O(n) complexity. - -##Mahout Model Creation - -See the page describing [*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) for more details. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/recommender-documentation.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/recommender-documentation.md b/website-old/docs/tutorials/map-reduce/recommender/recommender-documentation.md deleted file mode 100644 index c6d9ffb..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/recommender-documentation.md +++ /dev/null @@ -1,277 +0,0 @@ ---- -layout: default -title: (Deprecated) Recommender Documentation -theme: - name: retro-mahout ---- - - -## Overview - -_This documentation concerns the non-distributed, non-Hadoop-based -recommender engine / collaborative filtering code inside Mahout. It was -formerly a separate project called "Taste" and has continued development -inside Mahout alongside other Hadoop-based code. It may be viewed as a -somewhat separate, more comprehensive and more mature aspect of this -code, compared to current development efforts focusing on Hadoop-based -distributed recommenders. This remains the best entry point into Mahout -recommender engines of all kinds._ - -A Mahout-based collaborative filtering engine takes users' preferences for -items ("tastes") and returns estimated preferences for other items. For -example, a site that sells books or CDs could easily use Mahout to figure -out, from past purchase data, which CDs a customer might be interested in -listening to. - -Mahout provides a rich set of components from which you can construct a -customized recommender system from a selection of algorithms. Mahout is -designed to be enterprise-ready; it's designed for performance, scalability -and flexibility. - -Top-level packages define the Mahout interfaces to these key abstractions: - -* **DataModel** -* **UserSimilarity** -* **ItemSimilarity** -* **UserNeighborhood** -* **Recommender** - -Subpackages of *org.apache.mahout.cf.taste.impl* hold implementations of -these interfaces. These are the pieces from which you will build your own -recommendation engine. That's it! - - -## Architecture - -![doc](../../images/taste-architecture.png) - -This diagram shows the relationship between various Mahout components in a -user-based recommender. An item-based recommender system is similar except -that there are no Neighborhood algorithms involved. - - -### Recommender -A Recommender is the core abstraction in Mahout. Given a DataModel, it can -produce recommendations. Applications will most likely use the -**GenericUserBasedRecommender** or **GenericItemBasedRecommender**, -possibly decorated by **CachingRecommender**. - - -### DataModel -A **DataModel** is the interface to information about user preferences. An -implementation might draw this data from any source, but a database is the -most likely source. Be sure to wrap this with a **ReloadFromJDBCDataModel** to get good performance! Mahout provides **MySQLJDBCDataModel**, for example, to access preference data from a database via JDBC and MySQL. Another exists for PostgreSQL. Mahout also provides a **FileDataModel**, which is fine for small applications. - -Users and items are identified solely by an ID value in the -framework. Further, this ID value must be numeric; it is a Java long type -through the APIs. A **Preference** object or **PreferenceArray** object -encapsulates the relation between user and preferred items (or items and -users preferring them). - -Finally, Mahout supports, in various ways, a so-called "boolean" data model -in which users do not express preferences of varying strengths for items, -but simply express an association or none at all. For example, while users -might express a preference from 1 to 5 in the context of a movie -recommender site, there may be no notion of a preference value between -users and pages in the context of recommending pages on a web site: there -is only a notion of an association, or none, between a user and pages that -have been visited. - - -### UserSimilarity -A **UserSimilarity** defines a notion of similarity between two users. This is -a crucial part of a recommendation engine. These are attached to a -**Neighborhood** implementation. **ItemSimilarity** is analagous, but find -similarity between items. - - -### UserNeighborhood -In a user-based recommender, recommendations are produced by finding a -"neighborhood" of similar users near a given user. A **UserNeighborhood** -defines a means of determining that neighborhood — for example, -nearest 10 users. Implementations typically need a **UserSimilarity** to -operate. - - -## Examples - -### User-based Recommender -User-based recommenders are the "original", conventional style of -recommender systems. They can produce good recommendations when tweaked -properly; they are not necessarily the fastest recommender systems and are -thus suitable for small data sets (roughly, less than ten million ratings). -We'll start with an example of this. - -First, create a **DataModel** of some kind. Here, we'll use a simple on based -on data in a file. The file should be in CSV format, with lines of the form -"userID,itemID,prefValue" (e.g. "39505,290002,3.5"): - - - DataModel model = new FileDataModel(new File("data.txt")); - - -We'll use the **PearsonCorrelationSimilarity** implementation of **UserSimilarity** -as our user correlation algorithm, and add an optional preference inference -algorithm: - - - UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model); - - -Now we create a **UserNeighborhood** algorithm. Here we use nearest-3: - - - UserNeighborhood neighborhood = - new NearestNUserNeighborhood(3, userSimilarity, model);{code} - -Now we can create our **Recommender**, and add a caching decorator: - - - Recommender recommender = - new GenericUserBasedRecommender(model, neighborhood, userSimilarity); - Recommender cachingRecommender = new CachingRecommender(recommender); - - -Now we can get 10 recommendations for user ID "1234" — done! - - List recommendations = - cachingRecommender.recommend(1234, 10); - - -## Item-based Recommender - -We could have created an item-based recommender instead. Item-based -recommenders base recommendation not on user similarity, but on item -similarity. In theory these are about the same approach to the problem, -just from different angles. However the similarity of two items is -relatively fixed, more so than the similarity of two users. So, item-based -recommenders can use pre-computed similarity values in the computations, -which make them much faster. For large data sets, item-based recommenders -are more appropriate. - -Let's start over, again with a **FileDataModel** to start: - - - DataModel model = new FileDataModel(new File("data.txt")); - - -We'll also need an **ItemSimilarity**. We could use -**PearsonCorrelationSimilarity**, which computes item similarity in realtime, -but, this is generally too slow to be useful. Instead, in a real -application, you would feed a list of pre-computed correlations to a -**GenericItemSimilarity**: - - - // Construct the list of pre-computed correlations - Collection correlations = - ...; - ItemSimilarity itemSimilarity = - new GenericItemSimilarity(correlations); - - - -Then we can finish as before to produce recommendations: - - - Recommender recommender = - new GenericItemBasedRecommender(model, itemSimilarity); - Recommender cachingRecommender = new CachingRecommender(recommender); - ... - List recommendations = - cachingRecommender.recommend(1234, 10); - - - -## Integration with your application - -You can create a Recommender, as shown above, wherever you like in your -Java application, and use it. This includes simple Java applications or GUI -applications, server applications, and J2EE web applications. - - -## Performance - -### Runtime Performance -The more data you give, the better. Though Mahout is designed for -performance, you will undoubtedly run into performance issues at some -point. For best results, consider using the following command-line flags to -your JVM: - -* -server: Enables the server VM, which is generally appropriate for -long-running, computation-intensive applications. -* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte -doesn't hurt when dealing with tens millions of preferences. Mahout -recommenders will generally use as much memory as you give it for caching, -which helps performance. Set the initial and max size to the same value to -avoid wasting time growing the heap, and to avoid having the JVM run minor -collections to avoid growing the heap, which will clear cached values. -* -da -dsa: Disable all assertions. -* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most -of them in this framework -* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only): -Use a GC algorithm designed to take advantage of multiple processors, and -designed for throughput. This is a default in J2SE 5.0. -* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can -only hurt in the presence of modern GC algorithms; they may force Mahout to -remove cached data needlessly. This flag isn't needed if you're sure your -code and third-party code you use doesn't call this method. - -Also consider the following tips: - -* Use **CachingRecommender** on top of your custom **Recommender** implementation. -* When using **JDBCDataModel**, make sure you wrap it with the **ReloadFromJDBCDataModel** to load data into memory!. - - -### Algorithm Performance: Which One Is Best? -There is no right answer; it depends on your data, your application, -environment, and performance needs. Mahout provides the building blocks -from which you can construct the best Recommender for your application. The -links below provide research on this topic. You will probably need a bit of -trial-and-error to find a setup that works best. The code sample above -provides a good starting point. - -Fortunately, Mahout provides a way to evaluate the accuracy of your -Recommender on your own data, in org.apache.mahout.cf.taste.eval - - - DataModel myModel = ...; - RecommenderBuilder builder = new RecommenderBuilder() { - public Recommender buildRecommender(DataModel model) { - // build and return the Recommender to evaluate here - } - }; - RecommenderEvaluator evaluator = - new AverageAbsoluteDifferenceRecommenderEvaluator(); - double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0); - - -For "boolean" data model situations, where there are no notions of -preference value, the above evaluation based on estimated preference does -not make sense. In this case, try a *RecommenderIRStatsEvaluator*, which presents -traditional information retrieval figures like precision and recall, which -are more meaningful. - - - -## Useful Links - - -Here's a handful of research papers that I've read and found particularly -useful: - -J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166) -," in Proceedings of the Fourteenth Conference on Uncertainity in -Artificial Intelligence (UAI 1998), 1998. - -B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation algorithms](http://www10.org/cdrom/papers/519/) -" in Proceedings of the Tenth International Conference on the World Wide -Web (WWW 10), pp. 285-295, 2001. - -P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905) -" in Proceedings of the 1994 ACM conference on Computer Supported -Cooperative Work (CSCW 1994), pp. 175-186, 1994. - -J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf) -" in Proceedings of the 22nd annual international ACM SIGIR Conference on -Research and Development in Information Retrieval (SIGIR 99), pp. 230-237, -1999. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md b/website-old/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md deleted file mode 100644 index cdf425e..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/recommender-first-timer-faq.md +++ /dev/null @@ -1,54 +0,0 @@ ---- -layout: default -title: (Deprecated) Recommender First-Timer FAQ -theme: - name: retro-mahout ---- - -# Recommender First Timer Dos and Don'ts - -Many people with an interest in recommenders arrive at Mahout since they're -building a first recommender system. Some starting questions have been -asked enough times to warrant a FAQ collecting advice and rules-of-thumb to -newcomers. - -For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/). - -Don't start with a distributed, Hadoop-based recommender; take on that -complexity only if necessary. Start with non-distributed recommenders. It -is simpler, has fewer requirements, and is more flexible. - -As a crude rule of thumb, a system with up to 100M user-item associations -(ratings, preferences) should "fit" onto one modern server machine with 4GB -of heap available and run acceptably as a real-time recommender. The system -is invariably memory-bound since keeping data in memory is essential to -performance. - -Beyond this point it gets expensive to deploy a machine with enough RAM, -so, designing for a distributed makes sense when nearing this scale. -However most applications don't "really" have 100M associations to process. -Data can be sampled; noisy and old data can often be aggressively pruned -without significant impact on the result. - -The next question is whether or not your system has preference values, or -ratings. Do users and items merely have an association or not, such as the -existence or lack of a click? or is behavior translated into some scalar -value representing the user's degree of preference for the item. - -If you have ratings, then a good place to start is a -GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity -metric. If you don't have ratings, then a good place to start is -GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity. - -If you want to do content-based item-item similarity, you need to implement -your own ItemSimilarity. - -If your data can be simply exported to a CSV file, use FileDataModel and -push new files periodically. -If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref" -counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on -top a ReloadFromJDBCDataModel. - -This should give a reasonable starter system which responds fast. The -nature of the system is that new data comes in from the file or database -only periodically -- perhaps on the order of minutes. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md b/website-old/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md deleted file mode 100644 index 6234275..0000000 --- a/website-old/docs/tutorials/map-reduce/recommender/userbased-5-minutes.md +++ /dev/null @@ -1,133 +0,0 @@ ---- -layout: default -title: (Deprecated) User Based Recommender in 5 Minutes -theme: - name: retro-mahout ---- - -# Creating a User-Based Recommender in 5 minutes - -##Prerequisites - -Create a java project in your favorite IDE and make sure mahout is on the classpath. The easiest way to accomplish this is by importing it via maven as described on the [Quickstart](/users/basics/quickstart.html) page. - - -## Dataset - -Mahout's recommenders expect interactions between users and items as input. The easiest way to supply such data to Mahout is in the form of a textfile, where every line has the format *userID,itemID,value*. Here *userID* and *itemID* refer to a particular user and a particular item, and *value* denotes the strength of the interaction (e.g. the rating given to a movie). - -In this example, we'll use some made up data for simplicity. Create a file called "dataset.csv" and copy the following example interactions into the file. - -
-1,10,1.0
-1,11,2.0
-1,12,5.0
-1,13,5.0
-1,14,5.0
-1,15,4.0
-1,16,5.0
-1,17,1.0
-1,18,5.0
-2,10,1.0
-2,11,2.0
-2,15,5.0
-2,16,4.5
-2,17,1.0
-2,18,5.0
-3,11,2.5
-3,12,4.5
-3,13,4.0
-3,14,3.0
-3,15,3.5
-3,16,4.5
-3,17,4.0
-3,18,5.0
-4,10,5.0
-4,11,5.0
-4,12,5.0
-4,13,0.0
-4,14,2.0
-4,15,3.0
-4,16,1.0
-4,17,4.0
-4,18,1.0
-
- -## Creating a user-based recommender - -Create a class called *SampleRecommender* with a main method. - -The first thing we have to do is load the data from the file. Mahout's recommenders use an interface called *DataModel* to handle interaction data. You can load our made up interactions like this: - -
-DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
-
- -In this example, we want to create a user-based recommender. The idea behind this approach is that when we want to compute recommendations for a particular users, we look for other users with a similar taste and pick the recommendations from their items. For finding similar users, we have to compare their interactions. There are several methods for doing this. One popular method is to compute the [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) between their interactions. In Mahout, you use this method as follows: - -
-UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
-
- -The next thing we have to do is to define which similar users we want to leverage for the recommender. For the sake of simplicity, we'll use all that have a similarity greater than *0.1*. This is implemented via a *ThresholdUserNeighborhood*: - -
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
- -Now we have all the pieces to create our recommender: - -
-UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
-
- -We can easily ask the recommender for recommendations now. If we wanted to get three items recommended for the user with *userID* 2, we would do it like this: - - -
-List recommendations = recommender.recommend(2, 3);
-for (RecommendedItem recommendation : recommendations) {
-  System.out.println(recommendation);
-}
-
- - -Congratulations, you have built your first recommender! - - -## Evaluation - -You might ask yourself, how to make sure that your recommender returns good results. Unfortunately, the only way to be really sure about the quality is by doing an A/B test with real users in a live system. - -We can however try to get a feel of the quality, by statistical offline evaluation. Just keep in mind that this does not replace a test with real users! - -One way to check whether the recommender returns good results is by doing a **hold-out** test. We partition our dataset into two sets: a trainingset consisting of 90% of the data and a testset consisting of 10%. Then we train our recommender using the training set and look how well it predicts the unknown interactions in the testset. - -To test our recommender, we create a class called *EvaluateRecommender* with a main method and add an inner class called *MyRecommenderBuilder* that implements the *RecommenderBuilder* interface. We implement the *buildRecommender* method and make it setup our user-based recommender: - -
-UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
-UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel);
-return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
-
- -Now we have to create the code for the test. We'll check how much the recommender misses the real interaction strength on average. We employ an *AverageAbsoluteDifferenceRecommenderEvaluator* for this. The following code shows how to put the pieces together and run a hold-out test: - -
-DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
-RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
-RecommenderBuilder builder = new MyRecommenderBuilder();
-double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
-System.out.println(result);
-
- -Note: if you run this test multiple times, you will get different results, because the splitting into trainingset and testset is done randomly. - - - - - - - - - - - http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/Foo.scala ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/misc/contributing-algos/Foo.scala b/website-old/docs/tutorials/misc/contributing-algos/Foo.scala deleted file mode 100644 index 171993f..0000000 --- a/website-old/docs/tutorials/misc/contributing-algos/Foo.scala +++ /dev/null @@ -1,75 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - - -package org.apache.mahout.math.algorithms.regression - - -import org.apache.mahout.math._ -import org.apache.mahout.math.drm._ -import org.apache.mahout.math.drm.RLikeDrmOps._ -import org.apache.mahout.math.function.VectorFunction -import org.apache.mahout.math.scalabindings._ -import org.apache.mahout.math.scalabindings.RLikeOps._ -import org.apache.mahout.math.{Matrix, Vector} -import org.apache.mahout.math.drm.DrmLike - -class Foo[K] extends RegressorFitter[K] { - - var guessThisNumber: Double = _ - - def setStandardHyperparameters(hyperparameters: Map[Symbol, Any] = Map('foo -> None)): Unit = { - guessThisNumber = hyperparameters.asInstanceOf[Map[Symbol, Double]].getOrElse('guessThisNumber, 1.0) - } - def fit(drmX : DrmLike[K], - drmTarget: DrmLike[K], - hyperparameters: (Symbol, Any)*): FooModel[K] ={ - /** - * Normally one would have a lot more code here. - */ - - var model = new FooModel[K] - - setStandardHyperparameters(hyperparameters.toMap) - model.guessThisNumber = guessThisNumber - model.summary = s"This model will always guess ${model.guessThisNumber}" - model - } -} - -class FooModel[K] extends RegressorModel[K] { - - var guessThisNumber: Double = _ - - def predict(drmPredictors: DrmLike[K]): DrmLike[K] = { - - // This is needed for MapBlock - implicit val ktag = drmPredictors.keyClassTag - // This is needed for broadcasting - implicit val ctx = drmPredictors.context - - val bcGuess = drmBroadcast(dvec(guessThisNumber)) - drmPredictors.mapBlock(1) { - case (keys, block: Matrix) => { - var outputBlock = new DenseMatrix(block.nrow, 1) - keys -> (outputBlock += bcGuess.value.get(0)) - } - } - } -} \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/RegressionSuiteBase.scala ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/misc/contributing-algos/RegressionSuiteBase.scala b/website-old/docs/tutorials/misc/contributing-algos/RegressionSuiteBase.scala deleted file mode 100644 index 3290ea1..0000000 --- a/website-old/docs/tutorials/misc/contributing-algos/RegressionSuiteBase.scala +++ /dev/null @@ -1,202 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.mahout.math.algorithms - -import org.apache.mahout.math.algorithms.regression._ -import org.apache.mahout.math.drm._ -import org.apache.mahout.math.drm.RLikeDrmOps._ -import org.apache.mahout.math.scalabindings._ -import org.apache.mahout.math.scalabindings.RLikeOps._ -import org.apache.mahout.test.DistributedMahoutSuite -import org.scalatest.{FunSuite, Matchers} - -trait RegressionSuiteBase extends DistributedMahoutSuite with Matchers { - this: FunSuite => - - val epsilon = 1E-6 - - test("ordinary least squares") { - /* - R Prototype: - dataM <- matrix( c(2, 2, 10.5, 10, 29.509541, - 1, 2, 12, 12, 18.042851, - 1, 1, 12, 13, 22.736446, - 2, 1, 11, 13, 32.207582, - 1, 2, 12, 11, 21.871292, - 2, 1, 16, 8, 36.187559, - 6, 2, 17, 1, 50.764999, - 3, 2, 13, 7, 40.400208, - 3, 3, 13, 4, 45.811716), nrow=9, ncol=5, byrow=TRUE) - - - X = dataM[, c(1,2,3,4)] - y = dataM[, c(5)] - - model <- lm(y ~ X ) - summary(model) - - */ - - val drmData = drmParallelize(dense( - (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios - (1, 2, 12, 12, 18.042851), // Cap'n'Crunch - (1, 1, 12, 13, 22.736446), // Cocoa Puffs - (2, 1, 11, 13, 32.207582), // Froot Loops - (1, 2, 12, 11, 21.871292), // Honey Graham Ohs - (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold - (6, 2, 17, 1, 50.764999), // Cheerios - (3, 2, 13, 7, 40.400208), // Clusters - (3, 3, 13, 4, 45.811716)), numPartitions = 2) - - - val drmX = drmData(::, 0 until 4) - val drmY = drmData(::, 4 until 5) - - val model = new OrdinaryLeastSquares[Int]().fit(drmX, drmY, 'calcCommonStatistics → false) - - val estimate = model.beta - val Ranswers = dvec(-1.336265, -13.157702, -4.152654, -5.679908, 163.179329) - - val epsilon = 1E-6 - (estimate - Ranswers).sum should be < epsilon - - // TODO add test for S.E / pvalue - } - - test("cochrane-orcutt"){ - /* R Prototype: - library(orcutt) - - df = data.frame(t(data.frame( - c(20.96, 127.3), - c(21.40, 130.0), - c(21.96, 132.7), - c(21.52, 129.4), - c(22.39, 135.0), - c(22.76, 137.1), - c(23.48, 141.2), - c(23.66, 142.8), - c(24.10, 145.5), - c(24.01, 145.3), - c(24.54, 148.3), - c(24.30, 146.4), - c(25.00, 150.2), - c(25.64, 153.1), - c(26.36, 157.3), - c(26.98, 160.7), - c(27.52, 164.2), - c(27.78, 165.6), - c(28.24, 168.7), - c(28.78, 171.7)))) - - rownames(df) <- NULL - colnames(df) <- c("y", "x") - my_lm = lm(y ~ x, data=df) - coch = cochrane.orcutt(my_lm) - - /////////////////////////////////////// - The R-implementation is kind of...silly. - - The above works- converges at 318 iterations- the transformed DW is 1.72, yet the rho is - .95882. After 318 iteartions, this will also report a rho of .95882 (which sugguests SEVERE - autocorrelation- nothing close to 1.72. - - At anyrate, the real prototype for this is the example from Applied Linear Statistcal Models - 5th Edition by Kunter, Nachstheim, Neter, and Li. They also provide some interesting notes on p 494: - 1) "Cochrane-Orcutt does not always work properly. A major reason is that when the error terms - are positively autocorrelated, the estimate r in (12.22) tends to underestimate the autocorrelation - parameter rho. When this bias is serious, it can significantly reduce the effectiveness of the - Cochrane-Orcutt approach. - 2. There exists an approximate relation between the Durbin Watson test statistic D in (12.14) - and the estimated autocorrelation paramater r in (12.22): - D ~= 2(1-r)" - - They also note on p492: - "... If the process does not terminate after one or two iterations, a different procedure - should be employed." - This differs from the logic found elsewhere, and the method presented in R where, in the simple - example in the prototype, the procedure runs for 318 iterations. This is why the default - maximum iteratoins are 3, and should be left as such. - - Also, the prototype and 'correct answers' are based on the example presented in Kunter et. al on - p492-4 (including dataset). - - */ - - val alsmBlaisdellCo = drmParallelize( dense( - (20.96, 127.3), - (21.40, 130.0), - (21.96, 132.7), - (21.52, 129.4), - (22.39, 135.0), - (22.76, 137.1), - (23.48, 141.2), - (23.66, 142.8), - (24.10, 145.5), - (24.01, 145.3), - (24.54, 148.3), - (24.30, 146.4), - (25.00, 150.2), - (25.64, 153.1), - (26.36, 157.3), - (26.98, 160.7), - (27.52, 164.2), - (27.78, 165.6), - (28.24, 168.7), - (28.78, 171.7) )) - - val drmY = alsmBlaisdellCo(::, 0 until 1) - val drmX = alsmBlaisdellCo(::, 1 until 2) - - var coModel = new CochraneOrcutt[Int]().fit(drmX, drmY , ('iterations -> 2)) - val coResiduals = drmY - coModel.predict(drmX) - - val correctRho = 0.631166 - (coModel.rhos(1) - correctRho) should be < epsilon - - val shortEpsilon = 1E-4 // book rounded off pretty short - val correctBeta = dvec(0.17376, -1.0685) - (coModel.betas(1) - correctBeta).sum.abs < shortEpsilon - - val correctSe = dvec(0.002957, 0.45332) - (coModel.se - correctSe).sum.abs < shortEpsilon - } - - test("foo") { - import org.apache.mahout.math.algorithms.regression.Foo - - val drmA = drmParallelize(dense((1.0, 1.2, 1.3, 1.4), - (1.1, 1.5, 2.5, 1.0), - (6.0, 5.2, -5.2, 5.3), - (7.0,6.0, 5.0, 5.0), - (10.0, 1.0, 20.0, -10.0))) - - val model = new Foo().fit(drmA(::, 0 until 2), drmA(::, 2 until 3), 'guessThisNumber -> 2.0) - - val myAnswer = model.predict(drmA).collect - val correctAnswer = dense( (2.0), - (2.0), - (2.0), - (2.0), - (2.0)) - - - val epsilon = 1E-6 - (myAnswer - correctAnswer).sum should be < epsilon - } -} http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/create-pr.png ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/misc/contributing-algos/create-pr.png b/website-old/docs/tutorials/misc/contributing-algos/create-pr.png deleted file mode 100644 index 692be93..0000000 Binary files a/website-old/docs/tutorials/misc/contributing-algos/create-pr.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/mahout/blob/9beddd31/website-old/docs/tutorials/misc/contributing-algos/github-branch.png ---------------------------------------------------------------------- diff --git a/website-old/docs/tutorials/misc/contributing-algos/github-branch.png b/website-old/docs/tutorials/misc/contributing-algos/github-branch.png deleted file mode 100644 index a5b2c92..0000000 Binary files a/website-old/docs/tutorials/misc/contributing-algos/github-branch.png and /dev/null differ