mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [8/9] mahout git commit: WEBSITE Triage of Old Site Migration
Date Sat, 29 Apr 2017 23:24:57 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/intro-cooccurrence-spark.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/intro-cooccurrence-spark.md b/website/old_site_migration/completed/intro-cooccurrence-spark.md
new file mode 100644
index 0000000..41506c7
--- /dev/null
+++ b/website/old_site_migration/completed/intro-cooccurrence-spark.md
@@ -0,0 +1,446 @@
+---
+layout: default
+title: Intro to Cooccurrence Recommenders with Spark
+theme:
+    name: retro-mahout
+---
+
+#Intro to Cooccurrence Recommenders with Spark
+
+Mahout provides several important building blocks for creating recommendations using Spark. *spark-itemsimilarity* can 
+be used to create "other people also liked these things" type recommendations and paired with a search engine can 
+personalize recommendations for individual users. *spark-rowsimilarity* can provide non-personalized content based 
+recommendations and when paired with a search engine can be used to personalize content based recommendations.
+
+![image](http://s6.postimg.org/r0m8bpjw1/recommender_architecture.png)
+
+This is a simplified Lambda architecture with Mahout's *spark-itemsimilarity* playing the batch model building role and a search engine playing the realtime serving role.
+
+You will create two collections, one for user history and one for item "indicators". Indicators are user interactions that lead to the wished for interaction. So for example if you wish a user to purchase something and you collect all users purchase interactions *spark-itemsimilarity* will create a purchase indicator from them. But you can also use other user interactions in a cross-cooccurrence calculation, to create purchase indicators. 
+
+User history is used as a query on the item collection with its cooccurrence and cross-cooccurrence indicators (there may be several indicators). The primary interaction or action is picked to be the thing you want to recommend, other actions are believed to be corelated but may not indicate exactly the same user intent. For instance in an ecom recommender a purchase is a very good primary action, but you may also know product detail-views, or additions-to-wishlists. These can be considered secondary actions which may all be used to calculate cross-cooccurrence indicators. The user history that forms the recommendations query will contain recorded primary and secondary actions all targetted towards the correct indicator fields.
+
+##References
+
+1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: [Creating a Unified Recommender](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)  LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.
+
+Below are the command line jobs but the drivers and associated code can also be customized and accessed from the Scala APIs.
+
+##1. spark-itemsimilarity
+*spark-itemsimilarity* is the Spark counterpart of the of the Mahout mapreduce job called *itemsimilarity*. It takes in elements of interactions, which have userID, itemID, and optionally a value. It will produce one of more indicator matrices created by comparing every user's interactions with every other user. The indicator matrix is an item x item matrix where the values are log-likelihood ratio strengths. For the legacy mapreduce version, there were several possible similarity measures but these are being deprecated in favor of LLR because in practice it performs the best.
+
+Mahout's mapreduce version of itemsimilarity takes a text file that is expected to have user and item IDs that conform to 
+Mahout's ID requirements--they are non-negative integers that can be viewed as row and column numbers in a matrix.
+
+*spark-itemsimilarity* also extends the notion of cooccurrence to cross-cooccurrence, in other words the Spark version will 
+account for multi-modal interactions and create cross-cooccurrence indicator matrices allowing the use of much more data in 
+creating recommendations or similar item lists. People try to do this by mixing different actions and giving them weights. 
+For instance they might say an item-view is 0.2 of an item purchase. In practice this is often not helpful. Spark-itemsimilarity's
+cross-cooccurrence is a more principled way to handle this case. In effect it scrubs secondary actions with the action you want
+to recommend.   
+
+
+    spark-itemsimilarity Mahout 1.0
+    Usage: spark-itemsimilarity [options]
+    
+    Disconnected from the target VM, address: '127.0.0.1:64676', transport: 'socket'
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
+      -i2 <value> | --input2 <value>
+            Secondary input path for cross-similarity calculation, same restrictions as "--input" (optional). Default: empty.
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mppu <value> | --maxPrefs <value>
+            Max number of preferences to consider per user (optional). Default: 500
+      -m <value> | --maxSimilaritiesPerItem <value>
+            Limit the number of similarities per item to this number (optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
+    
+    Input text file schema options:
+      -id <value> | --inDelim <value>
+            Input delimiter character (optional). Default: "[,\t]"
+      -f1 <value> | --filter1 <value>
+            String (or regex) whose presence indicates a datum for the primary item set (optional). Default: no filter, all data is used
+      -f2 <value> | --filter2 <value>
+            String (or regex) whose presence indicates a datum for the secondary item set (optional). If not present no secondary dataset is collected
+      -rc <value> | --rowIDColumn <value>
+            Column number (0 based Int) containing the row ID string (optional). Default: 0
+      -ic <value> | --itemIDColumn <value>
+            Column number (0 based Int) containing the item ID string (optional). Default: 1
+      -fc <value> | --filterColumn <value>
+            Column number (0 based Int) containing the filter string (optional). Default: -1 for no filter
+    
+    Using all defaults the input is expected of the form: "userID<tab>itemId" or "userID<tab>itemID<tab>any-text..." and all rows will be used
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match --filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list (optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: false.
+    This option is used to output indexable data for creating a search engine recommender.
+    
+    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node (optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+This looks daunting but defaults to simple fairly sane values to take exactly the same input as legacy code and is pretty flexible. It allows the user to point to a single text file, a directory full of files, or a tree of directories to be traversed recursively. The files included can be specified with either a regex-style pattern or filename. The schema for the file is defined by column numbers, which map to the important bits of data including IDs and values. The files can even contain filters, which allow unneeded rows to be discarded or used for cross-cooccurrence calculations.
+
+See ItemSimilarityDriver.scala in Mahout's spark module if you want to customize the code. 
+
+###Defaults in the _**spark-itemsimilarity**_ CLI
+
+If all defaults are used the input can be as simple as:
+
+    userID1,itemID1
+    userID2,itemID2
+    ...
+
+With the command line:
+
+
+    bash$ mahout spark-itemsimilarity --input in-file --output out-dir
+
+
+This will use the "local" Spark context and will output the standard text version of a DRM
+
+    itemID1<tab>itemID2:value2<space>itemID10:value10...
+
+###<a name="multiple-actions">How To Use Multiple User Actions</a>
+
+Often we record various actions the user takes for later analytics. These can now be used to make recommendations. 
+The idea of a recommender is to recommend the action you want the user to make. For an ecom app this might be 
+a purchase action. It is usually not a good idea to just treat other actions the same as the action you want to recommend. 
+For instance a view of an item does not indicate the same intent as a purchase and if you just mixed the two together you 
+might even make worse recommendations. It is tempting though since there are so many more views than purchases. With *spark-itemsimilarity*
+we can now use both actions. Mahout will use cross-action cooccurrence analysis to limit the views to ones that do predict purchases.
+We do this by treating the primary action (purchase) as data for the indicator matrix and use the secondary action (view) 
+to calculate the cross-cooccurrence indicator matrix.  
+
+*spark-itemsimilarity* can read separate actions from separate files or from a mixed action log by filtering certain lines. For a mixed 
+action log of the form:
+
+    u1,purchase,iphone
+    u1,purchase,ipad
+    u2,purchase,nexus
+    u2,purchase,galaxy
+    u3,purchase,surface
+    u4,purchase,iphone
+    u4,purchase,galaxy
+    u1,view,iphone
+    u1,view,ipad
+    u1,view,nexus
+    u1,view,galaxy
+    u2,view,iphone
+    u2,view,ipad
+    u2,view,nexus
+    u2,view,galaxy
+    u3,view,surface
+    u3,view,nexus
+    u4,view,iphone
+    u4,view,ipad
+    u4,view,galaxy
+
+###Command Line
+
+
+Use the following options:
+
+    bash$ mahout spark-itemsimilarity \
+    	--input in-file \     # where to look for data
+        --output out-path \   # root dir for output
+        --master masterUrl \  # URL of the Spark master server
+        --filter1 purchase \  # word that flags input for the primary action
+        --filter2 view \      # word that flags input for the secondary action
+        --itemIDPosition 2 \  # column that has the item ID
+        --rowIDPosition 0 \   # column that has the user ID
+        --filterPosition 1    # column that has the filter word
+
+
+
+###Output
+
+The output of the job will be the standard text version of two Mahout DRMs. This is a case where we are calculating 
+cross-cooccurrence so a primary indicator matrix and cross-cooccurrence indicator matrix will be created
+
+    out-path
+      |-- similarity-matrix - TDF part files
+      \-- cross-similarity-matrix - TDF part-files
+
+The similarity-matrix will contain the lines:
+
+    galaxy\tnexus:1.7260924347106847
+    ipad\tiphone:1.7260924347106847
+    nexus\tgalaxy:1.7260924347106847
+    iphone\tipad:1.7260924347106847
+    surface
+
+The cross-similarity-matrix will contain:
+
+    iphone\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
+    ipad\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
+    nexus\tnexus:0.6795961471815897 iphone:0.6795961471815897 ipad:0.6795961471815897 galaxy:0.6795961471815897
+    galaxy\tnexus:1.7260924347106847 iphone:1.7260924347106847 ipad:1.7260924347106847 galaxy:1.7260924347106847
+    surface\tsurface:4.498681156950466 nexus:0.6795961471815897
+
+**Note:** You can run this multiple times to use more than two actions or you can use the underlying 
+SimilarityAnalysis.cooccurrence API, which will more efficiently calculate any number of cross-cooccurrence indicators.
+
+###Log File Input
+
+A common method of storing data is in log files. If they are written using some delimiter they can be consumed directly by spark-itemsimilarity. For instance input of the form:
+
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tpurchase\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tpurchase\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tpurchase\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu1\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu2\tview\trandom text\tgalaxy
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tsurface
+    2014-06-23 14:46:53.115\tu3\tview\trandom text\tnexus
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tiphone
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tipad
+    2014-06-23 14:46:53.115\tu4\tview\trandom text\tgalaxy    
+
+Can be parsed with the following CLI and run on the cluster producing the same output as the above example.
+
+    bash$ mahout spark-itemsimilarity \
+        --input in-file \
+        --output out-path \
+        --master spark://sparkmaster:4044 \
+        --filter1 purchase \
+        --filter2 view \
+        --inDelim "\t" \
+        --itemIDPosition 4 \
+        --rowIDPosition 1 \
+        --filterPosition 2
+
+##2. spark-rowsimilarity
+
+*spark-rowsimilarity* is the companion to *spark-itemsimilarity* the primary difference is that it takes a text file version of 
+a matrix of sparse vectors with optional application specific IDs and it finds similar rows rather than items (columns). Its use is
+not limited to collaborative filtering. The input is in text-delimited form where there are three delimiters used. By 
+default it reads (rowID&lt;tab>columnID1:strength1&lt;space>columnID2:strength2...) Since this job only supports LLR similarity,
+ which does not use the input strengths, they may be omitted in the input. It writes 
+(rowID&lt;tab>rowID1:strength1&lt;space>rowID2:strength2...) 
+The output is sorted by strength descending. The output can be interpreted as a row ID from the primary input followed 
+by a list of the most similar rows.
+
+The command line interface is:
+
+    spark-rowsimilarity Mahout 1.0
+    Usage: spark-rowsimilarity [options]
+    
+    Input, output options
+      -i <value> | --input <value>
+            Input path, may be a filename, directory name, or comma delimited list of HDFS supported URIs (required)
+      -o <value> | --output <value>
+            Path for output, any local or HDFS supported URI (required)
+    
+    Algorithm control options:
+      -mo <value> | --maxObservations <value>
+            Max number of observations to consider per row (optional). Default: 500
+      -m <value> | --maxSimilaritiesPerRow <value>
+            Limit the number of similarities per item to this number (optional). Default: 100
+    
+    Note: Only the Log Likelihood Ratio (LLR) is supported as a similarity measure.
+    Disconnected from the target VM, address: '127.0.0.1:49162', transport: 'socket'
+    
+    Output text file schema options:
+      -rd <value> | --rowKeyDelim <value>
+            Separates the rowID key from the vector values list (optional). Default: "\t"
+      -cd <value> | --columnIdStrengthDelim <value>
+            Separates column IDs from their values in the vector values list (optional). Default: ":"
+      -td <value> | --elementDelim <value>
+            Separates vector element values in the values list (optional). Default: " "
+      -os | --omitStrength
+            Do not write the strength to the output files (optional), Default: false.
+    This option is used to output indexable data for creating a search engine recommender.
+    
+    Default delimiters will produce output of the form: "itemID1<tab>itemID2:value2<space>itemID10:value10..."
+    
+    File discovery options:
+      -r | --recursive
+            Searched the -i path recursively for files that match --filenamePattern (optional), Default: false
+      -fp <value> | --filenamePattern <value>
+            Regex to match in determining input files (optional). Default: filename in the --input option or "^part-.*" if --input is a directory
+    
+    Spark config options:
+      -ma <value> | --master <value>
+            Spark Master URL (optional). Default: "local". Note that you can specify the number of cores to get a performance improvement, for example "local[4]"
+      -sem <value> | --sparkExecutorMem <value>
+            Max Java heap available as "executor memory" on each node (optional). Default: 4g
+      -rs <value> | --randomSeed <value>
+            
+      -h | --help
+            prints this usage text
+
+See RowSimilarityDriver.scala in Mahout's spark module if you want to customize the code. 
+
+#3. Using *spark-rowsimilarity* with Text Data
+
+Another use case for *spark-rowsimilarity* is in finding similar textual content. For instance given the tags associated with 
+a blog post,
+ which other posts have similar tags. In this case the columns are tags and the rows are posts. Since LLR is 
+the only similarity method supported this is not the optimal way to determine general "bag-of-words" document similarity. 
+LLR is used more as a quality filter than as a similarity measure. However *spark-rowsimilarity* will produce 
+lists of similar docs for every doc if input is docs with lists of terms. The Apache [Lucene](http://lucene.apache.org) project provides several methods of [analyzing and tokenizing](http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/analysis/package-summary.html#package_description) documents.
+
+#<a name="unified-recommender">4. Creating a Multimodal Recommender</a>
+
+Using the output of *spark-itemsimilarity* and *spark-rowsimilarity* you can build a miltimodal cooccurrence and content based
+ recommender that can be used in both or either mode depending on indicators available and the history available at 
+runtime for a user. Some slide describing this method can be found [here](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+
+##Requirements
+
+1. Mahout SNAPSHOT-1.0 or later
+2. Hadoop
+3. Spark, the correct version for your version of Mahout and Hadoop
+4. A search engine like Solr or Elasticsearch
+
+##Indicators
+
+Indicators come in 3 types
+
+1. **Cooccurrence**: calculated with *spark-itemsimilarity* from user actions
+2. **Content**: calculated from item metadata or content using *spark-rowsimilarity*
+3. **Intrinsic**: assigned to items as metadata. Can be anything that describes the item.
+
+The query for recommendations will be a mix of values meant to match one of your indicators. The query can be constructed 
+from user history and values derived from context (category being viewed for instance) or special precalculated data 
+(popularity rank for instance). This blending of indicators allows for creating many flavors or recommendations to fit 
+a very wide variety of circumstances.
+
+With the right mix of indicators developers can construct a single query that works for completely new items and new users 
+while working well for items with lots of interactions and users with many recorded actions. In other words by adding in content and intrinsic 
+indicators developers can create a solution for the "cold-start" problem that gracefully improves with more user history
+and as items have more interactions. It is also possible to create a completely content-based recommender that personalizes 
+recommendations.
+
+##Example with 3 Indicators
+
+You will need to decide how you store user action data so they can be processed by the item and row similarity jobs and 
+this is most easily done by using text files as described above. The data that is processed by these jobs is considered the 
+training data. You will need some amount of user history in your recs query. It is typical to use the most recent user history 
+but need not be exactly what is in the training set, which may include a greater volume of historical data. Keeping the user 
+history for query purposes could be done with a database by storing it in a users table. In the example above the two 
+collaborative filtering actions are "purchase" and "view", but let's also add tags (taken from catalog categories or other 
+descriptive metadata). 
+
+We will need to create 1 cooccurrence indicator from the primary action (purchase) 1 cross-action cooccurrence indicator 
+from the secondary action (view) 
+and 1 content indicator (tags). We'll have to run *spark-itemsimilarity* once and *spark-rowsimilarity* once.
+
+We have described how to create the collaborative filtering indicators for purchase and view (the [How to use Multiple User 
+Actions](#multiple-actions) section) but tags will be a slightly different process. We want to use the fact that 
+certain items have tags similar to the ones associated with a user's purchases. This is not a collaborative filtering indicator 
+but rather a "content" or "metadata" type indicator since you are not using other users' history, only the 
+individual that you are making recs for. This means that this method will make recommendations for items that have 
+no collaborative filtering data, as happens with new items in a catalog. New items may have tags assigned but no one
+ has purchased or viewed them yet. In the final query we will mix all 3 indicators.
+
+##Content Indicator
+
+To create a content-indicator we'll make use of the fact that the user has purchased items with certain tags. We want to find 
+items with the most similar tags. Notice that other users' behavior is not considered--only other item's tags. This defines a 
+content or metadata indicator. They are used when you want to find items that are similar to other items by using their 
+content or metadata, not by which users interacted with them.
+
+**Note**: It may be advisable to treat tags as cross-cooccurrence indicators but for the sake of an example they are treated here as content only.
+
+For this we need input of the form:
+
+    itemID<tab>list-of-tags
+    ...
+
+The full collection will look like the tags column from a catalog DB. For our ecom example it might be:
+
+    3459860b<tab>men long-sleeve chambray clothing casual
+    9446577d<tab>women tops chambray clothing casual
+    ...
+
+We'll use *spark-rowimilairity* because we are looking for similar rows, which encode items in this case. As with the 
+collaborative filtering indicators we use the --omitStrength option. The strengths created are 
+probabilistic log-likelihood ratios and so are used to filter unimportant similarities. Once the filtering or downsampling 
+is finished we no longer need the strengths. We will get an indicator matrix of the form:
+
+    itemID<tab>list-of-item IDs
+    ...
+
+This is a content indicator since it has found other items with similar content or metadata.
+
+    3459860b<tab>3459860b 3459860b 6749860c 5959860a 3434860a 3477860a
+    9446577d<tab>9446577d 9496577d 0943577d 8346577d 9442277d 9446577e
+    ...  
+    
+We now have three indicators, two collaborative filtering type and one content type.
+
+##Multimodal Recommender Query
+
+The actual form of the query for recommendations will vary depending on your search engine but the intent is the same. For a given user, map their history of an action or content to the correct indicator field and perform an OR'd query. 
+
+We have 3 indicators, these are indexed by the search engine into 3 fields, we'll call them "purchase", "view", and "tags". 
+We take the user's history that corresponds to each indicator and create a query of the form:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: tags; q:user's-tags-associated-with-purchases
+      
+The query will result in an ordered list of items recommended for purchase but skewed towards items with similar tags to 
+the ones the user has already purchased. 
+
+This is only an example and not necessarily the optimal way to create recs. It illustrates how business decisions can be 
+translated into recommendations. This technique can be used to skew recommendations towards intrinsic indicators also. 
+For instance you may want to put personalized popular item recs in a special place in the UI. Create a popularity indicator 
+by tagging items with some category of popularity (hot, warm, cold for instance) then
+index that as a new indicator field and include the corresponding value in a query 
+on the popularity field. If we use the ecom example but use the query to get "hot" recommendations it might look like this:
+
+    Query:
+      field: purchase; q:user's-purchase-history
+      field: view; q:user's view-history
+      field: popularity; q:"hot"
+
+This will return recommendations favoring ones that have the intrinsic indicator "hot".
+
+##Notes
+1. Use as much user action history as you can gather. Choose a primary action that is closest to what you want to recommend and the others will be used to create cross-cooccurrence indicators. Using more data in this fashion will almost always produce better recommendations.
+2. Content can be used where there is no recorded user behavior or when items change too quickly to get much interaction history. They can be used alone or mixed with other indicators.
+3. Most search engines support "boost" factors so you can favor one or more indicators. In the example query, if you want tags to only have a small effect you could boost the CF indicators.
+4. In the examples we have used space delimited strings for lists of IDs in indicators and in queries. It may be better to use arrays of strings if your storage system and search engine support them. For instance Solr allows multi-valued fields, which correspond to arrays.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/mailing-lists,-irc-and-archives.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/mailing-lists,-irc-and-archives.md b/website/old_site_migration/completed/mailing-lists,-irc-and-archives.md
new file mode 100644
index 0000000..e3862ca
--- /dev/null
+++ b/website/old_site_migration/completed/mailing-lists,-irc-and-archives.md
@@ -0,0 +1,75 @@
+---
+layout: default
+title: Mailing Lists, IRC and Archives
+theme:
+    name: retro-mahout
+---
+
+# General
+
+Communication at Mahout happens primarily online via mailing lists. We have
+a user as well as a dev list for discussion. In addition there is a commit
+list so we are able to monitor what happens on the wiki and in svn.
+
+<a name="MailingLists,IRCandArchives-Mailinglists"></a>
+# Mailing lists
+
+<a name="MailingLists,IRCandArchives-MahoutUserList"></a>
+## Mahout User List
+
+This list is for users of Mahout to ask questions, share knowledge, and
+discuss issues. Do send mail to this list with usage and configuration
+questions and problems. Also, please send questions to this list to verify
+your problem before filing issues in JIRA. 
+
+* [Subscribe](mailto:user-subscribe@mahout.apache.org)
+* [Unsubscribe](mailto:user-unsubscribe@mahout.apache.org)
+
+<a name="MailingLists,IRCandArchives-MahoutDeveloperList"></a>
+## Mahout Developer List
+
+This is the list where participating developers of the Mahout project meet
+and discuss issues concerning Mahout internals, code changes/additions,
+etc. Do not send mail to this list with usage questions or configuration
+questions and problems. 
+
+Discussion list: 
+
+* [Subscribe](mailto:dev-subscribe@mahout.apache.org)
+ -- Do not send mail to this list with usage questions or configuration
+questions and problems. 
+* [Unsubscribe](mailto:dev-unsubscribe@mahout.apache.org)
+
+Commit notifications: 
+
+* [Subscribe](mailto:commits-subscribe@mahout.apache.org)
+* [Unsubscribe](mailto:commits-unsubscribe@mahout.apache.org)
+
+<a name="MailingLists,IRCandArchives-IRC"></a>
+# IRC
+
+Mahout's IRC channel is **#mahout**.  It is a logged channel.  Please keep in
+mind that it is for discussion purposes only and that (pseudo)decisions
+should be brought back to the dev@ mailing list or JIRA and other people
+who are not on IRC should be given time to respond before any work is
+committed.
+
+<a name="MailingLists,IRCandArchives-Archives"></a>
+# Archives
+
+<a name="MailingLists,IRCandArchives-OfficialApacheArchive"></a>
+## Official Apache Archive
+
+* [http://mail-archives.apache.org/mod_mbox/mahout-dev/](http://mail-archives.apache.org/mod_mbox/mahout-dev/)
+* [http://mail-archives.apache.org/mod_mbox/mahout-user/](http://mail-archives.apache.org/mod_mbox/mahout-user/)
+
+<a name="MailingLists,IRCandArchives-ExternalArchives"></a>
+## External Archives
+
+* [MarkMail](http://mahout.markmail.org/)
+* [Gmane](http://dir.gmane.org/gmane.comp.apache.mahout.user)
+
+Please note the inclusion of a link to an archive does not imply an
+endorsement of that company by any of the committers of Mahout the Lucene
+PMC or the Apache Software Foundation. Each archive owner is solely
+responsible for the contents and availability of their archive.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/out-of-core-reference.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/out-of-core-reference.md b/website/old_site_migration/completed/out-of-core-reference.md
new file mode 100644
index 0000000..8b2eb75
--- /dev/null
+++ b/website/old_site_migration/completed/out-of-core-reference.md
@@ -0,0 +1,318 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+# Mahout-Samsara's Distributed Linear Algebra DSL Reference
+
+**Note: this page is meant only as a quick reference to Mahout-Samsara's R-Like DSL semantics.  For more information, including information on Mahout-Samsara's Algebraic Optimizer please see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf).**
+
+The subjects of this reference are solely applicable to Mahout-Samsara's **DRM** (distributed row matrix).
+
+In this reference, DRMs will be denoted as e.g. `A`, and in-core matrices as e.g. `inCoreA`.
+
+#### Imports 
+
+The following imports are used to enable seamless in-core and distributed algebraic DSL operations:
+
+    import org.apache.mahout.math._
+    import scalabindings._
+    import RLikeOps._
+    import drm._
+    import RLikeDRMOps._
+    
+If working with mixed scala/java code:
+    
+    import collection._
+    import JavaConversions._
+    
+If you are working with Mahout-Samsara's Spark-specific operations e.g. for context creation:
+
+    import org.apache.mahout.sparkbindings._
+    
+The Mahout shell does all of these imports automatically.
+
+
+#### DRM Persistence operators
+
+**Mahout-Samsara's DRM persistance to HDFS is compatible with all Mahout-MapReduce algorithms such as seq2sparse.**
+
+
+Loading a DRM from (HD)FS:
+
+    drmDfsRead(path = hdfsPath)
+     
+Parallelizing from an in-core matrix:
+
+    val inCoreA = (dense(1, 2, 3), (3, 4, 5))
+    val A = drmParallelize(inCoreA)
+    
+Creating an empty DRM:
+
+    val A = drmParallelizeEmpty(100, 50)
+    
+Collecting to driver's jvm in-core:
+
+    val inCoreA = A.collect
+    
+**Warning: The collection of distributed matrices happens implicitly whenever conversion to an in-core (o.a.m.math.Matrix) type is required. E.g.:**
+
+    val inCoreA: Matrix = ...
+    val drmB: DrmLike[Int] =...
+    val inCoreC: Matrix = inCoreA %*%: drmB
+    
+**implies (incoreA %*%: drmB).collect**
+
+Collecting to (HD)FS as a Mahout's DRM formatted file:
+
+    A.dfsWrite(path = hdfsPath)
+    
+#### Logical algebraic operators on DRM matrices:
+
+A logical set of operators are defined for distributed matrices as a subset of those defined for in-core matrices.  In particular, since all distributed matrices are immutable, there are no assignment operators (e.g. **A += B**)
+*Note: please see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) for information on Mahout-Samsars's Algebraic Optimizer, and translation from logical operations to a physical plan for the back end.*
+ 
+    
+Cache a DRM and trigger an optimized physical plan: 
+
+    drmA.checkpoint(CacheHint.MEMORY_AND_DISK)
+   
+Other valid caching Instructions:
+
+    drmA.checkpoint(CacheHint.NONE)
+    drmA.checkpoint(CacheHint.DISK_ONLY)
+    drmA.checkpoint(CacheHint.DISK_ONLY_2)
+    drmA.checkpoint(CacheHint.MEMORY_ONLY)
+    drmA.checkpoint(CacheHint.MEMORY_ONLY_2)
+    drmA.checkpoint(CacheHint.MEMORY_ONLY_SER
+    drmA.checkpoint(CacheHint.MEMORY_ONLY_SER_2)
+    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_2)
+    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER)
+    drmA.checkpoint(CacheHint.MEMORY_AND_DISK_SER_2)
+
+*Note: Logical DRM operations are lazily computed.  Currently the actual computations and optional caching will be triggered by dfsWrite(...), collect(...) and blockify(...).*
+
+
+
+Transposition:
+
+    A.t
+ 
+Elementwise addition *(Matrices of identical geometry and row key types)*:
+  
+    A + B
+
+Elementwise subtraction *(Matrices of identical geometry and row key types)*:
+
+    A - B
+    
+Elementwise multiplication (Hadamard) *(Matrices of identical geometry and row key types)*:
+
+    A * B
+    
+Elementwise division *(Matrices of identical geometry and row key types)*:
+
+    A / B
+    
+**Elementwise operations involving one in-core argument (int-keyed DRMs only)**:
+
+    A + inCoreB
+    A - inCoreB
+    A * inCoreB
+    A / inCoreB
+    A :+ inCoreB
+    A :- inCoreB
+    A :* inCoreB
+    A :/ inCoreB
+    inCoreA +: B
+    inCoreA -: B
+    inCoreA *: B
+    inCoreA /: B
+
+Note the Spark associativity change (e.g. `A *: inCoreB` means `B.leftMultiply(A`), same as when both arguments are in core). Whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.
+    
+**Matrix-matrix multiplication %*%**:
+
+`\(\mathbf{M}=\mathbf{AB}\)`
+
+    A %*% B
+    A %*% inCoreB
+    A %*% inCoreDiagonal
+    A %*%: B
+
+
+*Note: same as above, whenever operator arguments include both in-core and out-of-core arguments, the operator can only be associated with the out-of-core (DRM) argument to support the distributed implementation.*
+ 
+**Matrix-vector multiplication %*%**
+Currently we support a right multiply product of a DRM and an in-core Vector(`\(\mathbf{Ax}\)`) resulting in a single column DRM, which then can be collected in front (usually the desired outcome):
+
+    val Ax = A %*% x
+    val inCoreX = Ax.collect(::, 0)
+    
+
+**Matrix-scalar +,-,*,/**
+Elementwise operations of every matrix element and a scalar:
+
+    A + 5.0
+    A - 5.0
+    A :- 5.0
+    5.0 -: A
+    A * 5.0
+    A / 5.0
+    5.0 /: a
+    
+Note that `5.0 -: A` means `\(m_{ij} = 5 - a_{ij}\)` and `5.0 /: A` means `\(m_{ij} = \frac{5}{a{ij}}\)` for all elements of the result.
+    
+    
+#### Slicing
+
+General slice:
+
+    A(100 to 200, 100 to 200)
+    
+Horizontal Block:
+
+    A(::, 100 to 200)
+    
+Vertical Block:
+
+    A(100 to 200, ::)
+    
+*Note: if row range is not all-range (::) the the DRM must be `Int`-keyed.  General case row slicing is not supported by DRMs with key types other than `Int`*.
+
+
+#### Stitching
+
+Stitch side by side (cbind R semantics):
+
+    val drmAnextToB = drmA cbind drmB
+    
+Stitch side by side (Scala):
+
+    val drmAnextToB = drmA.cbind(drmB)
+    
+Analogously, vertical concatenation is available via **rbind**
+
+#### Custom pipelines on blocks
+Internally, Mahout-Samsara's DRM is represented as a distributed set of vertical (Key, Block) tuples.
+
+**drm.mapBlock(...)**:
+
+The DRM operator `mapBlock` provides transformational access to the distributed vertical blockified tuples of a matrix (Row-Keys, Vertical-Matrix-Block).
+
+Using `mapBlock` to add 1.0 to a DRM:
+
+    val inCoreA = dense((1, 2, 3), (2, 3 , 4), (3, 4, 5))
+    val drmA = drmParallelize(inCoreA)
+    val B = A.mapBlock() {
+        case (keys, block) => keys -> (block += 1.0)
+    }
+    
+#### Broadcasting Vectors and matrices to closures
+Generally we can create and use one-way closure attributes to be used on the back end.
+
+Scalar matrix multiplication:
+
+    val factor: Int = 15
+    val drm2 = drm1.mapBlock() {
+        case (keys, block) => block *= factor
+        keys -> block
+    }
+
+**Closure attributes must be java-serializable. Currently Mahout's in-core Vectors and Matrices are not java-serializable, and must be broadcast to the closure using `drmBroadcast(...)`**:
+
+    val v: Vector ...
+    val bcastV = drmBroadcast(v)
+    val drm2 = drm1.mapBlock() {
+        case (keys, block) =>
+            for(row <- 0 until block.nrow) block(row, ::) -= bcastV
+        keys -> block    
+    }
+
+#### Computations providing ad-hoc summaries
+
+
+Matrix cardinality:
+
+    drmA.nrow
+    drmA.ncol
+
+*Note: depending on the stage of optimization, these may trigger a computational action.  I.e. if one calls `nrow()` n times, then the back end will actually recompute `nrow` n times.*
+    
+Means and sums:
+
+    drmA.colSums
+    drmA.colMeans
+    drmA.rowSums
+    drmA.rowMeans
+    
+ 
+*Note: These will always trigger a computational action.  I.e. if one calls `colSums()` n times, then the back end will actually recompute `colSums` n times.*
+
+#### Distributed Matrix Decompositions
+
+To import the decomposition package:
+    
+    import org.apache.mahout.math._
+    import decompositions._
+    
+Distributed thin QR:
+
+    val (drmQ, incoreR) = dqrThin(drmA)
+    
+Distributed SSVD:
+ 
+    val (drmU, drmV, s) = dssvd(drmA, k = 40, q = 1)
+    
+Distributed SPCA:
+
+    val (drmU, drmV, s) = dspca(drmA, k = 30, q = 1)
+
+Distributed regularized ALS:
+
+    val (drmU, drmV, i) = dals(drmA,
+                            k = 50,
+                            lambda = 0.0,
+                            maxIterations = 10,
+                            convergenceThreshold = 0.10))
+                            
+#### Adjusting parallelism of computations
+
+Set the minimum parallelism to 100 for computations on `drmA`:
+
+    drmA.par(min = 100)
+ 
+Set the exact parallelism to 100 for computations on `drmA`:
+
+    drmA.par(exact = 100)
+
+
+Set the engine specific automatic parallelism adjustment for computations on `drmA`:
+
+    drmA.par(auto = true)
+
+#### Retrieving the engine specific data structure backing the DRM:
+
+**A Spark RDD:**
+
+    val myRDD = drmA.checkpoint().rdd
+    
+**An H2O Frame and Key Vec:**
+
+    val myFrame = drmA.frame
+    val myKeys = drmA.keys
+    
+**A Flink DataSet:**
+
+    val myDataSet = drmA.ds
+    
+For more information including information on Mahout-Samsara's Algebraic Optimizer and in-core Linear algebra bindings see: [Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines](http://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf)
+
+
+
+    
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/privacy-policy.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/privacy-policy.md b/website/old_site_migration/completed/privacy-policy.md
new file mode 100644
index 0000000..bc10929
--- /dev/null
+++ b/website/old_site_migration/completed/privacy-policy.md
@@ -0,0 +1,28 @@
+---
+layout: default
+title: Privacy Policy
+theme:
+    name: retro-mahout
+---
+Information about your use of this website is collected using server access
+logs and a tracking cookie. The collected information consists of the
+following:
+
+* The IP address from which you access the website;
+* The type of browser and operating system you use to access our site;
+* The date and time you access our site;
+* The pages you visit; and
+* The addresses of pages from where you followed a link to our site.
+
+Part of this information is gathered using a tracking cookie set by the
+Google Analytics service and handled by Google as described in their
+privacy policy. See your browser documentation for instructions on how to
+disable the cookie if you prefer not to share this data with Google.
+
+We use the gathered information to help us make our site more useful to
+visitors and to better understand how and when our site is used. We do not
+track or collect personally identifiable information or associate gathered
+data with any personally identifying information from other sources.
+
+By using this website, you consent to the collection of this data in the
+manner and for the purpose described above.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/quickstart.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/quickstart.md b/website/old_site_migration/completed/quickstart.md
new file mode 100644
index 0000000..b84a245
--- /dev/null
+++ b/website/old_site_migration/completed/quickstart.md
@@ -0,0 +1,59 @@
+---
+layout: default
+title: Quickstart
+theme:
+    name: retro-mahout
+---
+
+# Mahout MapReduce Overview
+
+## Getting Mahout
+
+#### Download the latest release
+
+Download the latest release [here](http://www.apache.org/dyn/closer.cgi/mahout/).
+
+Or checkout the latest code from [here](http://mahout.apache.org/developers/version-control.html)
+
+#### Alternatively: Add Mahout 0.10.0 to a maven project
+
+Mahout is also available via a [maven repository](http://mvnrepository.com/artifact/org.apache.mahout) under the group id *org.apache.mahout*.
+If you would like to import the latest release of mahout into a java project, add the following dependency in your *pom.xml*:
+
+    <dependency>
+        <groupId>org.apache.mahout</groupId>
+        <artifactId>mahout-mr</artifactId>
+        <version>0.10.0</version>
+    </dependency>
+
+
+## Features
+
+For a full list of Mahout's features see our [Features by Engine](http://mahout.apache.org/users/basics/algorithms.html) page.
+
+    
+## Using Mahout
+
+Mahout has prepared a bunch of examples and tutorials for users to quickly learn how to use its machine learning algorithms.
+
+#### Recommendations
+
+Check the [Recommender Quickstart](/users/recommender/quickstart.html) or the tutorial on [creating a userbased recommender in 5 minutes](/users/recommender/userbased-5-minutes.html).
+
+If you are building a recommender system for the first time, please also refer to a list of [Dos and Don'ts](/users/recommender/recommender-first-timer-faq.html) that might be helpful.
+
+#### Clustering
+
+Check the [Synthetic data](/users/clustering/clustering-of-synthetic-control-data.html) example.
+
+#### Classification
+
+If you are interested in how to train a **Naive Bayes** model, look at the [20 newsgroups](/users/classification/twenty-newsgroups.html) example.
+
+If you plan to build a **Hidden Markov Model** for speech recognition, the example [here](/users/classification/hidden-markov-models.html) might be instructive. 
+
+Or you could build a **Random Forest** model by following this [quick start page](/users/classification/partial-implementation.html).
+
+#### Working with Text 
+
+If you need to convert raw text into word vectors as input to clustering or classification algorithms, please refer to this page on [how to create vectors from text](/users/basics/creating-vectors-from-text.html).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/release-notes.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/release-notes.md b/website/old_site_migration/completed/release-notes.md
new file mode 100644
index 0000000..bd41ebb
--- /dev/null
+++ b/website/old_site_migration/completed/release-notes.md
@@ -0,0 +1,242 @@
+---
+layout: default
+title: Release Notes
+theme:
+    name: retro-mahout
+---
+
+# Release Notes
+
+#### 11 April 2016 - Apache Mahout 0.12.0 released
+
+This release marks a major milestone for the “Samsara” environment’s goal
+of providing an engine neutral math platform by now supporting Apache Flink.
+While still experimental, the mahout Flink bindings now offer all of the R-Like
+semantics for linear algebra operations, matrix decompositions, 
+and algorithms of the “Samsara” platform for execution on a Flink back-end.
+
+This release gives users of Apache Flink out of the box access to the following features (and more):
+
+<ol>
+<li>The Mahout Distributed Row Matrix (DRM) API.</li>
+<li>Distributed and local Vector and Matrix algebra routines.</li>
+<li>Distributed and local Stochastic Principal Component Analysis.</li>
+<li>Distributed and local Stochastic Singular Value Decomposition.</li>
+<li>Distributed and local Thin QR Decomposition.</li>
+<li>Collaborative Filtering.</li>
+<li>Naive Bayes Classification.</li>
+<li>Matrix operations (only listing a few here):
+<ol>
+<li>Mahout-native blockified distributed Matrix map and allreduce routines.</li>
+<li>Distributed data point (row) sampling.</li>
+<li>Matrix/Matrix Squared Distance.</li>
+<li>Element-wise log.</li>
+<li>Element-wise roots.</li>
+<li>Element-wise Matrix/Matrix addition, subtraction, division and multiplication.</li>
+<li>Functional Matrix value assignment.</li>
+<li>A familiar Scala-based R-like DSL.</li>
+</ol>
+</ol>
+
+#### 11 March 2016 - Apache Mahout 0.11.2 released
+
+This is a minor release over Mahout 0.11.1 meant to introduce major
+performance enhancements with sparse matrix and vector computations, and
+major performance optimizations to the Samsara DSL.  Mahout 0.11.2 includes
+all new features and bug fixes released in Mahout versions 0.11.0 and
+0.11.1.
+
+Highlights include:
+
+* Spark 1.5.2 support
+*   Performance improvements of over 30% on Sparse Vector and Matrix
+   computations leveraging the ‘fastutil’ library -  contribution from
+   Sebastiano Vigna. This speeds up all in-core sparse vector and matrix
+   computations.
+
+
+#### 06 November 2015 - Apache Mahout 0.11.1 released
+
+This is a minor release over Mahout 0.11.0 meant to expand Mahout’s
+compatibility with Spark versions, to introduce some new features and to
+fix some bugs.  Mahout 0.11.1 includes all new features and bug fixes
+released in Mahout versions 0.11.0 and earlier.
+
+Highlights include:
+
+* Spark 1.4+ support
+* 4x Performance improvement in Dot Product over Dense Vectors (https://issues.apache.org/jira/browse/MAHOUT-1781)
+
+
+#### 07 August 2015 - Apache Mahout 0.11.0 released
+
+Mahout 0.11.0 includes all new features and bugfixes released in Mahout versions 0.10.1
+and 0.10.2 along with support for Spark 1.3+.
+
+Highlights include:
+
+* Spark 1.3 support
+* Fixes for a major memory usage bug in co-occurrence analysis used by the driver spark-itemsimilarity. This will now require far less memory in the executor.
+* Some minor fixes to Mahout-Samsara QR Decomposition and matrix ops.
+* All of the Mahout Samsara fixes from 0.10.2 Release
+
+
+#### 06 August 2015 - Apache Mahout 0.10.2 released
+
+Highlights include:
+
+* In-core transpose view rewrites. Modifiable transpose views eg. (for (col <- a.t) col := 5).
+* Performance and parallelization improvements for AB', A'B, A'A spark physical operators.
+* Optional structural "flavor" abstraction for in-core matrices. In-core matrices can now be tagged as e.g. sparse or dense.
+* %*% optimization based on matrix flavors.
+* In-core ::= sparse assignment functions.
+* Assign := optimization (do proper traversal based on matrix flavors, similarly to %*%).
+* Adding in-place elementwise functional assignment (e.g. mxA := exp _, mxA ::= exp _).
+* Distributed and in-core version of simple elementwise analogues of scala.math._. for example, for log(x) the convention is dlog(drm), mlog(mx), vlog(vec). Unfortunately we cannot overload these functions over what is done in scala.math, i.e. scala would not allow log(mx) or log(drm) and log(Double) at the same time, mainly because they are being defined in different packages.
+* Distributed and in-core first and second moment routines. R analogs: mean(), colMeans(), rowMeans(), variance(), sd(). By convention, distributed versions are prepended by (d) letter: colMeanVars() colMeanStdevs() dcolMeanVars() dcolMeanStdevs().
+* Distance and squared distance matrix routines. R analog: dist(). Provide both squared and non-squared Euclidean distance matrices. By convention, distributed versions are prepended by (d) letter: dist(x), sqDist(x), dsqDist(x). Also a variation for pair-wise distance matrix of two different inputs x and y: sqDist(x,y), dsqDist(x,y). 
+* DRM row sampling api.
+* Distributed performance bug fixes. This relates mostly to (a) matrix multiplication deficiencies, and (b) handling parallelism.
+* Distributed engine neutral allreduceBlock() operator api for Spark and H2O.
+* Distributed optimizer operators for elementwise functions. Rewrites recognizing e.g. 1+ drmX * dexp(drmX) as a single fused elementwise physical operator: elementwiseFunc(f1(f2(drmX)) where f1 = 1 + x and f2 = exp(x).
+* More cbind, rbind flavors (e.g. 1 cbind mxX, 1 cbind drmX or the other way around) for Spark and H2O.
+* Added +=: and *=: operators on vectors.
+* Closeable API for broadcast tensors.
+* Support for conversion of any type-keyed DRM into ordinally-keyed DRM.
+* Scala logging style. 
+* rowSumsMap() summary for non-int-keyed DRMs.
+* elementwise power operator ^ . 
+* R-like vector concatenation operator. 
+* In-core functional assignments e.g.: mxA := { (x) => x * x}. 
+* Straighten out behavior of Matrix.iterator() and iterateNonEmpty().
+* New mutable transposition view for in-core matrices.  In-core matrix transpose view. rewrite with mostly two goals in mind: (1) enable mutability, e.g. for (col <- mxA.t) col := k (2) translate matrix structural flavor for optimizers correctly. i.e. new SparseRowMatrix.t carries on as column-major structure.
+* Native support for kryo serialization of tensor types.
+* Deprecation of the MultiLayerPerceptron, ConcatenateVectorsJob and all related classes.
+* Deprecation of SparseColumnMatrix.
+
+#### 31 May 2015 - Apache Mahout 0.10.1 released
+
+Highlights include:
+
+* Major memory use improvements in  cooccurrence analysis including the spark-itemsimilarity driver [MAHOUT-1707](https://issues.apache.org/jira/browse/MAHOUT-1707)
+* Support for Spark version 1.2.2 or less.
+* Some minor fixes to Mahout-Samsara QR Decomposition and matrix ops.
+* Trim down packages size to < 200MB MAHOUT-1704 and MAHOUT-1706
+* Minor testing indicates binary compatibility with Spark 1.3 with the exception of the Mahout Shell.
+
+#### 11 April 2015 - Apache Mahout 0.10.0 released
+
+Mahout 0.10.0 was a major release, which separates out a ML environment (we call Mahout-Samsara) including an 
+extended version of Scala that is largely backend independent but runs fully on Spark. The Hadoop MapReduce versions of 
+Mahout algorithms are still maintained but no new MapReduce contributions are accepted. From this release onwards 
+contributions must be Mahout Samsara based or at least run on Spark. 
+
+Highlights include:
+
+New Mahout Samsara Environment
+
+* Distributed Algebraic optimizer
+* R-Like DSL Scala API
+* Linear algebra operations
+* Ops are extensions to Scala
+* Scala REPL based interactive shell running on Spark
+* Integrates with compatible libraries like MLlib
+* Run on distributed Spark
+* H2O in progress
+
+New Mahout Samsara based Algorithms
+
+* Stochastic Singular Value Decomposition (ssvd, dssvd)
+* Stochastic Principal Component Analysis (spca, dspca)
+* Distributed Cholesky QR (thinQR)
+* Distributed regularized Alternating Least Squares (dals)
+* Collaborative Filtering: Item and Row Similarity
+* Naive Bayes Classification
+* Distributed and in-core
+
+Changes in 0.10.0 are detailed <a href="https://github.com/apache/mahout/blob/mahout-0.10.0/CHANGELOG">here</a>
+
+#### 1 February 2014 - Apache Mahout 0.9 released
+
+  <p>Highlights include:</p>
+
+  <ul>
+   <li>New and improved Mahout website based on Apache CMS - <a href="https://issues.apache.org/jira/browse/MAHOUT-1245">MAHOUT-1245</a></li>
+   <li>Early implementation of a Multi Layer Perceptron (MLP) classifier - <a href="https://issues.apache.org/jira/browse/MAHOUT-1265">MAHOUT-1265</a>.</li>
+   <li>Scala DSL Bindings for Mahout Math Linear Algebra.  See <a href="http://weatheringthrutechdays.blogspot.com/2013/07/scala-dsl-for-mahout-in-core-linear.html">this blogpost</a> - <a href="https://issues.apache.org/jira/browse/MAHOUT-1297">MAHOUT-1297</a></li>
+   <li>Recommenders as a Search.  See <a href="https://github.com/pferrel/solr-recommender">https://github.com/pferrel/solr-recommender</a> - <a href="https://issues.apache.org/jira/browse/MAHOUT-1288">MAHOUT-1288</a></li>
+   <li>Support for easy functional Matrix views and derivatives - <a href="https://issues.apache.org/jira/browse/MAHOUT-1300">MAHOUT-1300</a></li>
+   <li>JSON output format for ClusterDumper - <a href="https://issues.apache.org/jira/browse/MAHOUT-1343">MAHOUT-1343</a></li>
+   <li>Enable randomised testing for all Mahout modules using Carrot RandomizedRunner - <a href="https://issues.apache.org/jira/browse/MAHOUT-1345">MAHOUT-1345</a></li>
+   <li>Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering - <a href="https://issues.apache.org/jira/browse/MAHOUT-1361">MAHOUT-1361</a>.  See this <a href="https://github.com/tdunning/t-digest/blob/master/docs/theory/t-digest-paper/histo.pdf">pdf</a> for the details.
+   <li>Upgrade to Lucene 4.6.1 - <a href="https://issues.apache.org/jira/browse/MAHOUT-1364">MAHOUT-1364</a></li>
+  </ul>
+
+  <p>Changes in 0.9 are detailed <a href="http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?view=markup&pathrev=1563661">here</a>.</p>
+
+#### 25 July 2013 - Apache Mahout 0.8 released
+
+  <p>Highlights include:</p>
+
+  <ul>
+    <li>Numerous performance improvements to Vector and Matrix implementations, API's and their iterators</li>
+		<li>Numerous performance improvements to the recommender implementations</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1088" class="external-link" rel="nofollow">MAHOUT-1088</a>: Support for biased item-based recommender</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1089" class="external-link" rel="nofollow">MAHOUT-1089</a>: SGD matrix factorization for rating prediction with user and item biases</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1106" class="external-link" rel="nofollow">MAHOUT-1106</a>: Support for SVD++</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-944" class="external-link" rel="nofollow">MAHOUT-944</a>:  Support for converting one or more Lucene storage indexes to SequenceFiles as well as an upgrade of the supported Lucene version to Lucene 4.3.1.</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1154" class="external-link" rel="nofollow">MAHOUT-1154</a> and friends: New streaming k-means implementation that offers on-line (and fast) clustering</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-833" class="external-link" rel="nofollow">MAHOUT-833</a>: Make conversion to SequenceFiles Map-Reduce, 'seqdirectory' can now be run as a MapReduce job.</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1052" class="external-link" rel="nofollow">MAHOUT-1052</a>: Add an option to MinHashDriver that specifies the dimension of vector to hash (indexes or values).</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-884" class="external-link" rel="nofollow">MAHOUT-884</a>: Matrix Concat utility, presently only concatenates two matrices.</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-1187" class="external-link" rel="nofollow">MAHOUT-1187</a>: Upgraded to CommonsLang3</li>
+		<li><a href="https://issues.apache.org/jira/browse/MAHOUT-916" class="external-link" rel="nofollow">MAHOUT-916</a>: Speedup the Mahout build by making tests run in parallel.</li>
+
+  </ul>
+
+  <p>Changes in 0.8 are detailed <a href="http://svn.apache.org/viewvc/mahout/trunk/CHANGELOG?revision=1501110&view=markup">here</a>.</p>
+
+#### 16 June 2012 - Apache Mahout 0.7 released
+
+  <p>Highlights include:</p>
+
+  <ul>
+    <li>Outlier removal capability in K-Means, Fuzzy K, Canopy and Dirichlet Clustering</li>
+    <li>New Clustering implementation for K-Means, Fuzzy K, Canopy and Dirichlet using Cluster Classifiers</li>
+    <li>Collections and Math API consolidated</li>
+    <li>(Complementary) Naive Bayes refactored and cleaned</li>
+    <li>Watchmaker and Old Naive Bayes dropped.</li>
+    <li>Many bug fixes, refactorings, and other small improvements</li>
+  </ul>
+
+  <p>Changes in 0.7 are detailed <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751&version=12319261">here</a>.</p>
+
+
+
+#### 6 Feb 2012 - Apache Mahout 0.6 released
+
+  <p>Highlights include:</p>
+
+  <ul>
+    <li>Improved Decision Tree performance and added support for regression problems</li>
+    <li>New LDA implementation using Collapsed Variational Bayes 0th Derivative Approximation</li>
+    <li>Reduced runtime of LanczosSolver tests</li>
+    <li>K-Trusses, Top-Down and Bottom-Up clustering, Random Walk with Restarts implementation</li>
+    <li>Reduced runtime of dot product between vectors</li>
+    <li>Added MongoDB and Cassandra DataModel support</li>
+    <li>Increased efficiency of parallel ALS matrix factorization</li>
+    <li>SSVD enhancements</li>
+    <li>Performance improvements in RowSimilarityJob, TransposeJob</li>
+    <li>Added numerous clustering display examples</li>
+    <li>Many bug fixes, refactorings, and other small improvements</li>
+  </ul>
+
+  <p>Changes in 0.6 are detailed <a href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12310751&version=12316364">here</a>.</p>
+
+#### Past Releases
+
+ * [Mahout 0.5](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12315255&styleName=Text&projectId=12310751&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED|20f0d06214912accbd47acf2f0a89231ed00a767|lin)
+ * [Mahout 0.4](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12314281&styleName=Text&projectId=12310751&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED|20f0d06214912accbd47acf2f0a89231ed00a767|lin)
+ * [Mahout 0.3](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12314281&styleName=Text&projectId=12310751&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED|20f0d06214912accbd47acf2f0a89231ed00a767|lin)
+ * [Mahout 0.2](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12313278&styleName=Text&projectId=12310751&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED|20f0d06214912accbd47acf2f0a89231ed00a767|lin) 
+ * [Mahout 0.1](https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12312976&styleName=Html&projectId=12310751&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED%7C48e83cdefb8bca42acf8f129692f8c3a05b360cf%7Clout)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/completed/who-we-are.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/who-we-are.md b/website/old_site_migration/completed/who-we-are.md
new file mode 100644
index 0000000..24f493b
--- /dev/null
+++ b/website/old_site_migration/completed/who-we-are.md
@@ -0,0 +1,62 @@
+---
+layout: default
+title: Who We Are
+theme:
+    name: retro-mahout
+---
+
+<a name="WhoWeAre-Whoweare"></a>
+# Who we are
+
+Apache Mahout is maintained by a team of volunteer developers.
+
+<a name="WhoWeAre-CoreCommitters"></a>
+## Core Committers
+
+(Please keep the list below in alphabetical order by first name.)
+
+Name | Mail | PMC | Comment 
+----|---------|------|------|----------
+Anand Avati | avati@... | No | Twitter: @anandavati
+Andrew Musselman | akm@... | Yes | Twitter: @akm
+Andrew Palumbo | apalumbo@... | Yes (Chair) | |
+Benson Margulies | bimargulies@... | Yes | |
+Dan Filimon | dfilimon@... | No | |
+Dmitriy Lyubimov | dlyubimov@... | No (Emeritus) | 
+Drew Farris | drew@... | Yes | |
+Ellen Friedman | ellenf@... | No | Twitter: @Ellen_Friedman 
+Frank Scholten | frankscholten@... | No | |
+Gokhan Capan | gcapan@... | No | <a href="http://www.linkedin.com/in/gokhancapan">LinkedIn Profile</a>
+Grant Ingersoll | gsingers@... | Yes | Twitter: @gsingers
+Isabel Drost-Fromm | isabel@... | Yes | Passion for free software (development, but to some extend also the political and economic implications), interested in agile development and project management, lives in Germany. Follow me on Twitter @MaineC
+Jacob Alexander Mannix | jmannix@... | Yes | |
+Jeff Eastman | jeastman@... | No (Emeritus) |
+Paritosh Ranjan | pranjan@... | Yes | Twitter: @paritoshranjan
+Pat Ferrel | pat@... |  Yes | Twitter: @occam 
+Robin Anil | robinanil@... | Yes | |
+Sean Owen | srowen@... | No (Emeritus) |
+Sebastian Schelter | ssc@... | Yes | |
+Shannon Quinn | squinn@... | No | |
+Stevo Slavić|  sslavic@... | No | Twitter: @sslavic
+Suneel Marthi | smarthi@... | Yes | Twitter: @suneelmarthi
+Ted Dunning | tdunning@... |  Yes | 
+Tom Pierce | tcp@... | No | |
+
+<a name="WhoWeAre-EmeritusCommitters"></a>
+## Emeritus Committers
+
+* Niranjan Balasubramanian (nbalasub@...)
+* Otis Gospodnetic (otis@...)
+* David Hall (dlwh@...)
+* Erik Hatcher (ehatcher@...)
+* Ozgur Yilmazel (oyilmazel@...)
+* Dawid Weiss (dweiss@...)
+* Karl Wettin (kalle@...)
+* AbdelHakim Deneche (adeneche@...)
+
+Note that the email addresses above end with @apache.org.
+
+<a name="WhoWeAre-Contributors"></a>
+## Contributors
+
+Apache Mahout contributors and their contributions to individual issues can be found at Apache <a href="http://issues.apache.org/jira/browse/MAHOUT">JIRA</a>.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/dont_migrate/collections.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/collections.md b/website/old_site_migration/dont_migrate/collections.md
new file mode 100644
index 0000000..327b079
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/collections.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: Collections
+theme:
+    name: retro-mahout
+---
+
+NOTE: Idk when this was made but there are lots of free data set sites now that track this...
+
+TODO: Organize these somehow, add one-line blurbs
+
+Organize by usage? (classification, recommendation etc.)
+
+<a name="Collections-CollectionsofCollections"></a>
+## Collections of Collections
+
+- [ML Data](http://mldata.org/about/)
+ ... repository supported by Pascal 2.
+- [DBPedia](http://wiki.dbpedia.org/Downloads30)
+- [UCI Machine Learning Repo](http://archive.ics.uci.edu/ml/)
+- [http://mloss.org/community/blog/2008/sep/19/data-sources/](http://mloss.org/community/blog/2008/sep/19/data-sources/)
+- [Linked Library Data](http://ckan.net/group/lld)
+ via CKAN
+- [InfoChimps](http://infochimps.com/)
+ Free and purchasable datasets
+- [http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle](http://www.linkedin.com/groupItem?view=&srchtype=discussedNews&gid=3638279&item=35736572&type=member&trk=EML_anet_ac_pst_ttle)
+ LinkedIn discussion of lots of data sets
+
+<a name="Collections-CategorizationData"></a>
+## Categorization Data
+
+- [20Newsgroups](http://people.csail.mit.edu/jrennie/20Newsgroups/)
+- [RCV1 data set](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm)
+- [10 years of CLEF Data](http://direct.dei.unipd.it/)
+- [http://ece.ut.ac.ir/DBRG/Hamshahri/](http://ece.ut.ac.ir/DBRG/Hamshahri/)
+ (Approximately 160k categorized docs)
+There is a newer beta verson here:[http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/](http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/)
+ (Approximately 320k categorized docs)
+- Lending Club load data [https://www.lendingclub.com/info/download-data.action](https://www.lendingclub.com/info/download-data.action)
+
+<a name="Collections-RecommendationData"></a>
+## Recommendation Data
+
+- [Book usage and recommendation data from the University of Huddersfield](http://library.hud.ac.uk/data/usagedata/)
+- [Last.fm](http://denoiserthebetter.posterous.com/music-recommendation-datasets)
+ \- Non-commercial use only
+- [Amazon Product Review Data via Jindal and Liu](http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
+ -- Scroll down
+- [GroupLens/MovieLens Movie Review Dataset](http://www.grouplens.org/node/73)
+
+<a name="Collections-MultilingualData"></a>
+## Multilingual Data
+
+- [http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php](http://urd.let.rug.nl/tiedeman/OPUS/OpenSubtitles.php)
+ \- 308,000 subtitle files covering about 18,900 movies in 59 languages
+(July 2006 numbers). This is a curated collection of subtitles from an
+aggregation site, [http://www.openSubTitles.org]
+The original site, OpenSubtitles.org, is up to 1.6m subtitles files.
+- [Statistical Machine Translation](http://www.statmt.org/)
+ \- devoted to all things language translation. Includes multilingual
+corpuses of European and Canadian legal tomes.
+
+<a name="Collections-Geospatial"></a>
+## Geospatial
+
+- [Natural Earth Data](http://www.naturalearthdata.com/)
+- [Open Street Maps](http://wiki.openstreetmap.org/wiki/Main_Page)
+And other crowd-sourced mapping data sites.
+
+<a name="Collections-Airline"></a>
+## Airline
+
+- [Open Flights](http://openflights.org/)
+ \- Crowd-sourced database of airlines, flights, airports, times, etc.
+- [Airline on-time information - 1987-2008](http://stat-computing.org/dataexpo/2009/)
+ \- 120m CSV records, 12G uncompressed
+
+<a name="Collections-GeneralResources"></a>
+## General Resources
+
+- [theinfo](http://theinfo.org/)
+- [WordNet](http://wordnet.princeton.edu/obtain)
+- [Common Crawl](http://www.commoncrawl.org/)
+ \- freely available web crawl on EC2
+
+<a name="Collections-Stuff"></a>
+## Stuff
+
+- [http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html](http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html)
+- [4 Universities Data Set](http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/)
+- [Large crawl of Twitter](http://an.kaist.ac.kr/traces/WWW2010.html)
+- [UniProt](http://beta.uniprot.org/)
+- [http://www.icwsm.org/2009/data/](http://www.icwsm.org/2009/data/)
+- [http://data.gov](http://data.gov)
+- [http://www.ckan.net/](http://www.ckan.net/)
+- [http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world](http://www.guardian.co.uk/news/datablog/2010/jan/07/government-data-world)
+- [http://data.gov.uk/](http://data.gov.uk/)
+- [51,000 US Congressional Bills tagged](http://www.ark.cs.cmu.edu/bills/)

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/dont_migrate/glossary.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/glossary.md b/website/old_site_migration/dont_migrate/glossary.md
new file mode 100644
index 0000000..5ff1015
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/glossary.md
@@ -0,0 +1,15 @@
+---
+layout: default
+title: Glossary
+theme:
+    name: retro-mahout
+---
+
+NOTE: no migrate- empty file. good idea though in general
+
+
+This is a list of common glossary terms used on both the mailing lists and
+around the site. Where possible I have tried to provide a link to more
+in-depth explanations from the web
+
+{children:excerpt=true|style=h4}

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/dont_migrate/mahout-benchmarks.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/mahout-benchmarks.md b/website/old_site_migration/dont_migrate/mahout-benchmarks.md
new file mode 100644
index 0000000..1502b90
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/mahout-benchmarks.md
@@ -0,0 +1,156 @@
+---
+layout: default
+title: Mahout Benchmarks
+theme:
+    name: retro-mahout
+---
+
+NOTE: These are all pretty old.  I think it would be cool to have a page
+like this but may need total reboot...
+
+<a name="MahoutBenchmarks-Introduction"></a>
+# Introduction
+
+Depending on hardware configuration, exact distribution of ratings over users and items YMMV!
+
+<a name="MahoutBenchmarks-Recommenders"></a>
+# Recommenders
+
+<a name="MahoutBenchmarks-ARuleofThumb"></a>
+## A Rule of Thumb
+
+100M preferences are about the data set size where non-distributed
+recommenders will outgrow a normal-sized machine (32-bit, <= 4GB RAM). Your
+mileage will vary significantly with the nature of the data.
+
+<a name="MahoutBenchmarks-Distributedrecommendervs.Wikipedialinks(May272010)"></a>
+## Distributed recommender vs. Wikipedia links (May 27 2010)
+
+From the mailing list:
+
+I just finished running a set of recommendations based on the Wikipedia
+link graph, for book purposes (yeah, it's unconventional). I ran on my
+laptop, but it ought to be crudely representative of how it runs in a real
+cluster.
+
+The input is 1058MB as a text file, and contains, 130M article-article
+associations, from 5.7M articles to 3.8M distinct articles ("users" and
+"items", respectively). I estimate cost based on Amazon's North
+American small Linux-based instance pricing of $0.085/hour. I ran on a
+dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is
+valid.
+
+In this run, I run recommendations for all 5.7M "users". You can certainly
+run for any subset of all users of course.
+
+Phase 1 (Item ID to item index mapping)
+29 minutes CPU time
+$0.05
+60MB output
+
+Phase 2 (Create user vectors)
+88 minutes CPU time
+$0.13
+Output: 1159MB
+
+Phase 3 (Count co-occurrence)
+77 hours CPU time
+$6.54
+Output: 23.6GB
+
+Phase 4 (Partial multiply prep)
+10.5 hours CPU time
+$0.90
+Output: 24.6GB
+
+Phase 5 (Aggregate and recommend)
+about 600 hours
+about $51.00
+about 10GB
+(I estimated these rather than let it run at home for days!)
+
+
+Note that phases 1 and 3 may be run less frequently, and need not be run
+every time. But the cost is dominated by the last step, which is most of
+the work. I've ignored storage costs.
+
+This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user
+recommendations. That's not bad if, say, you want to update recs for you
+site's 100,000 daily active users for a dollar.
+
+There are several levers one could pull internally to sacrifice accuracy
+for speed, but it's currently set to pretty normal values. So this is just
+one possibility.
+
+Now that's not terrible, but it is about 8x more computing than would be
+needed by a non-distributed implementation *if* you could fit the whole
+data set into a very large instance's memory, which is still possible at
+this scale but needs a pretty big instance. That's a very apples-to-oranges
+comparison of course; different algorithms, entirely different
+environments. This is about the amount of overhead I'd expect from
+distributing -- interesting to note how non-trivial it is.
+
+<a name="MahoutBenchmarks-Non-distributedrecommendervs.KDDCupdataset(March2011)"></a>
+## Non-distributed recommender vs. KDD Cup data set (March 2011)
+
+(From the user@mahout.apache.org mailing list)
+
+I've been test-driving a simple application of Mahout recommenders (the
+non-distributed kind) on Amazon EC2 on the new Yahoo KDD Cup data set
+(kddcup.yahoo.com).
+
+In the spirit of open-source, like I mentioned, I'm committing the extra
+code to mahout-examples that can be used to run a Recommender on the input
+and output the right format. And, I'd like to publish the rough timings
+too. Find all the source in org.apache.mahout.cf.taste.example.kddcup
+
+<a name="MahoutBenchmarks-Track1"></a>
+### Track 1
+
+* m2.2xlarge instance, 34.2GB RAM / 4 cores
+* Steady state memory consumption: ~19GB
+* Computation time: 30 hours (wall clock-time)
+* CPU time per user: ~0.43 sec
+* Cost on EC2: $34.20 (!)
+
+(Helpful hint on cost I realized after the fact: you can almost surely get
+spot instances for cheaper. The maximum price this sort of instance has
+gone for as a spot instance is about $0.60/hour, vs "retail price" of
+$1.14/hour.)
+
+Resulted in an RMSE of 29.5618 (the rating scale is 0-100), which is only
+good enough for 29th place at the moment. Not terrible for "out of the box"
+performance -- it's just using an item-based recommender with uncentered
+cosine similarity. But not really good in absolute terms. A winning
+solution is going to try to factor in time, and apply more sophisticated
+techniques. The best RMSE so far is about 23.
+
+<a name="MahoutBenchmarks-Track2"></a>
+### Track 2
+
+* c1.xlarge instance: 7GB RAM / 8 cores
+* Steady state memory consumption: ~3.8GB
+* Computation time: 4.1 hours (wall clock-time)
+* CPU time per user: ~1.1 sec
+* Cost on EC2: $3.20
+
+For this I bothered to write a simplistic item-item similarity metric to
+take into account the additional info that is available: track, artist,
+album, genre. The result was comparatively better: 17.92% error rate, good
+enough for 4th place at the moment.
+
+Of course, the next task is to put this through the actual distributed
+processing -- that's really the appropriate solution.
+
+This shows you can still tackle fairly impressive scale with a
+non-distributed solution. These results suggest that the largest instances
+available from EC2 would accomodate almost 1 billion ratings in memory.
+However at that scale running a user's full recommendations would easily be
+measured in seconds, not milliseconds.
+
+<a name="MahoutBenchmarks-Clustering"></a>
+# Clustering
+
+See [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/dont_migrate/mahoutintegration.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/mahoutintegration.md b/website/old_site_migration/dont_migrate/mahoutintegration.md
new file mode 100644
index 0000000..e2d01ea
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/mahoutintegration.md
@@ -0,0 +1,6 @@
+---
+layout: default
+title: MahoutIntegration
+theme:
+    name: retro-mahout
+---

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/dont_migrate/recommender-overview.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/dont_migrate/recommender-overview.md b/website/old_site_migration/dont_migrate/recommender-overview.md
new file mode 100644
index 0000000..a48d47f
--- /dev/null
+++ b/website/old_site_migration/dont_migrate/recommender-overview.md
@@ -0,0 +1,34 @@
+---
+layout: default
+title: Recommender Quickstart
+theme:
+    name: retro-mahout
+---
+
+Not migrating bc seems to be same content as intro-coocurrence-spark.md
+
+# Recommender Overview
+
+Recommenders have changed over the years. Mahout contains a long list of them, which you can still use. But to get the best  out of our more modern aproach we'll need to think of the Recommender as a "model creation" component&mdash;supplied by Mahout's new spark-itemsimilarity job, and a "serving" component&mdash;supplied by a modern scalable search engine, like Solr.
+
+![image](http://i.imgur.com/fliHMBo.png)
+
+To integrate with your application you will collect user interactions storing them in a DB and also in a from usable by Mahout. The simplest way to do this is to log user interactions to csv files (user-id, item-id). The DB should be setup to contain the last n user interactions, which will form part of the query for recommendations.
+
+Mahout's spark-itemsimilarity will create a table of (item-id, list-of-similar-items) in csv form. Think of this as an item collection with one field containing the item-ids of similar items. Index this with your search engine. 
+
+When your application needs recommendations for a specific person, get the latest user history of interactions from the DB and query the indicator collection with this history. You will get back an ordered list of item-ids. These are your recommendations. You may wish to filter out any that the user has already seen but that will depend on your use case.
+
+All ids for users and items are preserved as string tokens and so work as an external key in DBs or as doc ids for search engines, they also work as tokens for search queries.
+
+##References
+
+1. A free ebook, which talks about the general idea: [Practical Machine Learning](https://www.mapr.com/practical-machine-learning)
+2. A slide deck, which talks about mixing actions or other indicators: [Creating a Multimodal Recommender with Mahout and a Search Engine](http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/)
+3. Two blog posts: [What's New in Recommenders: part #1](http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/)
+and  [What's New in Recommenders: part #2](http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/)
+3. A post describing the loglikelihood ratio:  [Surprise and Coinsidense](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)  LLR is used to reduce noise in the data while keeping the calculations O(n) complexity.
+
+##Mahout Model Creation
+
+See the page describing [*spark-itemsimilarity*](http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html) for more details.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/algorithms.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/algorithms.md b/website/old_site_migration/needs_work_convenience/algorithms.md
new file mode 100644
index 0000000..657efde
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/algorithms.md
@@ -0,0 +1,58 @@
+---
+layout: default
+title: Algorithms
+theme:
+    name: retro-mahout
+---
+
+
+---
+*Mahout 0.12.0 Features by Engine*   
+---
+
+| | **Single Machine** | [**MapReduce**](http://hadoop.apache.org/)| [**Spark**](https://spark.apache.org/) | [**H2O**](http://0xdata.com/) | [**Flink**](https://flink.apache.org/) |
+---------------------------------------------|:----------------:|:-----------:|:------:|:---:|:----:|
+**Mahout Math-Scala Core Library and Scala DSL**| 
+|   [Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis](http://mahout.apache.org/users/sparkbindings/home.html).    | |  | [x](https://mahout.apache.org/users/sparkbindings/ScalaSparkBindings.pdf) | [x](https://github.com/apache/mahout/tree/master/h2o) |[x](https://github.com/apache/mahout/tree/flink-binding/flink)
+||
+**Mahout Interactive Shell**| 
+|   [Interactive REPL shell for Spark optimized Mahout DSL](http://mahout.apache.org/users/sparkbindings/play-with-shell.html) | | | x |
+||
+**Collaborative Filtering** *with CLI drivers*|
+    User-Based Collaborative Filtering           | *deprecated* | *deprecated*|[x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html)
+    Item-Based Collaborative Filtering           | x | [x](https://mahout.apache.org/users/recommender/intro-itembased-hadoop.html) | [x](https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html) |
+    Matrix Factorization with ALS | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
+    Matrix Factorization with ALS on Implicit Feedback | x | [x](https://mahout.apache.org/users/recommender/intro-als-hadoop.html) |  |
+    Weighted Matrix Factorization, SVD++  | x | | 
+||
+**Classification** *with CLI drivers*| | |
+    Logistic Regression - trained via SGD   | [*deprecated*](http://mahout.apache.org/users/classification/logistic-regression.html) |
+    Naive Bayes / Complementary Naive Bayes  | | [*deprecated*](https://mahout.apache.org/users/classification/bayesian.html) | [x](https://mahout.apache.org/users/algorithms/spark-naive-bayes.html) | 
+    Hidden Markov Models   | [*deprecated*](https://mahout.apache.org/users/classification/hidden-markov-models.html) |
+||
+**Clustering** *with CLI drivers*||
+    Canopy Clustering  | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/canopy-clustering.html)| 
+    k-Means Clustering   | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) | [*deprecated*](https://mahout.apache.org/users/clustering/k-means-clustering.html) |  
+    Fuzzy k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/fuzzy-k-means.html)|  
+    Streaming k-Means   | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) | [*deprecated*](https://mahout.apache.org/users/clustering/streaming-k-means.html) |  
+    Spectral Clustering   |  | [*deprecated*](https://mahout.apache.org/users/clustering/spectral-clustering.html) |  
+||
+**Dimensionality Reduction** *note: most scala-based dimensionality reduction algorithms are available through the [Mahout Math-Scala Core Library for all engines](https://mahout.apache.org/users/sparkbindings/home.html)*||
+    Singular Value Decomposition | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html) |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
+    Lanczos Algorithm  | *deprecated* | *deprecated* | 
+    Stochastic SVD  | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [*deprecated*](https://mahout.apache.org/users/dim-reduction/ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html) | [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)|    [x](http://mahout.apache.org/users/algorithms/d-ssvd.html)
+    PCA (via Stochastic SVD) | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/sparkbindings/home.html)  |[x](http://mahout.apache.org/users/environment/h2o-internals.html) |   [x](http://mahout.apache.org/users/flinkbindings/flink-internals.html)
+    QR Decomposition         | *deprecated* | *deprecated* | [x](http://mahout.apache.org/users/algorithms/d-qr.html) |[x](http://mahout.apache.org/users/algorithms/d-qr.html) |   [x](http://mahout.apache.org/users/algorithms/d-qr.html)
+||
+**Topic Models**||
+    Latent Dirichlet Allocation  | *deprecated* | *deprecated* |
+||
+**Miscellaneous**||
+    RowSimilarityJob   |  | *deprecated* | [x](https://github.com/apache/mahout/blob/master/spark/src/test/scala/org/apache/mahout/drivers/RowSimilarityDriverSuite.scala) |
+    Collocations  |  | [*deprecated*](https://mahout.apache.org/users/basics/collocations.html) |  
+    Sparse TF-IDF Vectors from Text |  | [*deprecated*](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) |
+    XML Parsing|  | [*deprecated*](https://issues.apache.org/jira/browse/MAHOUT-1479?jql=text%20~%20%22wikipedia%20mahout%22) |
+    Email Archive Parsing |  | [*deprecated*](https://github.com/apache/mahout/tree/master/integration/src/main/java/org/apache/mahout/text) | 
+    Evolutionary Processes | [x](https://github.com/apache/mahout/tree/master/mr/src/main/java/org/apache/mahout/ep) |
+    
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/bayesian-commandline.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/bayesian-commandline.md b/website/old_site_migration/needs_work_convenience/bayesian-commandline.md
new file mode 100644
index 0000000..d81d3ef
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/bayesian-commandline.md
@@ -0,0 +1,64 @@
+---
+layout: default
+title: bayesian-commandline
+theme:
+    name: retro-mahout
+---
+
+# Naive Bayes commandline documentation
+
+<a name="bayesian-commandline-Introduction"></a>
+## Introduction
+
+This quick start page describes how to run the naive bayesian and
+complementary naive bayesian classification algorithms on a Hadoop cluster.
+
+<a name="bayesian-commandline-Steps"></a>
+## Steps
+
+<a name="bayesian-commandline-Testingitononesinglemachinew/ocluster"></a>
+### Testing it on one single machine w/o cluster
+
+In the examples directory type:
+
+    mvn -q exec:java
+        -Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.bayes.<JOB>"
+        -Dexec.args="<OPTIONS>"
+
+    mvn -q exec:java
+        -Dexec.mainClass="org.apache.mahout.classifier.bayes.mapreduce.cbayes.<JOB>"
+        -Dexec.args="<OPTIONS>"
+
+
+<a name="bayesian-commandline-Runningitonthecluster"></a>
+### Running it on the cluster
+
+* In $MAHOUT_HOME/, build the jar containing the job (mvn install) The job
+will be generated in $MAHOUT_HOME/core/target/ and it's name will contain
+the Mahout version number. For example, when using Mahout 0.1 release, the
+job will be mahout-core-0.1.jar
+
+* (Optional) 1 Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
+
+* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
+
+* Run the Job: $HADOOP_HOME/bin/hadoop jar
+
+    $MAHOUT_HOME/core/target/mahout-core-<MAHOUT VERSION>.job
+        org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver <OPTIONS>
+
+* Get the data out of HDFS and have a look. Use bin/hadoop fs -lsr output
+to view all outputs.
+
+<a name="bayesian-commandline-Commandlineoptions"></a>
+## Command line options
+
+    BayesDriver, BayesThetaNormalizerDriver, CBayesNormalizedWeightDriver, CBayesDriver, CBayesThetaDriver, CBayesThetaNormalizerDriver, BayesWeightSummerDriver, BayesFeatureDriver, BayesTfIdfDriver Usage:
+        [--input <input> --output <output> --help]
+      
+    Options
+    
+      --input (-i) input	  The Path for input Vectors. Must be a SequenceFile of Writable, Vector.
+      --output (-o) output	  The directory pathname for output points.
+      --help (-h)		  Print out help.
+


Mime
View raw message