mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rawkintr...@apache.org
Subject [4/9] mahout git commit: WEBSITE Triage of Old Site Migration
Date Sat, 29 Apr 2017 23:24:53 GMT
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md b/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
new file mode 100644
index 0000000..57378ba
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
@@ -0,0 +1,222 @@
+---
+layout: default
+title: Using Mahout with Python via JPype
+theme:
+    name: retro-mahout
+---
+
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# Mahout over Jython - some examples
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+    	ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with thevector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, path,io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]):
+      nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):	
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]=="NamedVector"):  
+       print "cluster={key} Name={name} x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need tomodify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+    	   ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):	
+      center=vecwritable.getCenter()
+      
+      print "id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
new file mode 100644
index 0000000..2acacd0
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+# Introduction to ALS Recommendations with Hadoop
+
+##Overview
+
+Mahout’s ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It factors the user to item matrix *A* into the user-to-feature matrix *U* and the item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. The algorithm details can be referred to in the following papers: 
+
+* [Large-scale Parallel Collaborative Filtering for
+the Netflix Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+* [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433) 
+
+This recommendation algorithm can be used in eCommerce platform to recommend products to customers. Unlike the user or item based recommenders that computes the similarity of users or items to make recommendations, the ALS algorithm uncovers the latent factors that explain the observed user to item ratings and tries to find optimal factor weights to minimize the least squares between predicted and actual ratings.
+
+Mahout's ALS recommendation algorithm takes as input user preferences by item and generates an output of recommending items for a user. The input customer preference could either be explicit user ratings or implicit feedback such as user's click on a web page.
+
+One of the strengths of the ALS based recommender, compared to the user or item based recommender, is its ability to handle large sparse data sets and its better prediction performance. It could also gives an intuitive rationale of the factors that influence recommendations.
+
+##Implementation
+At present Mahout has a map-reduce implementation of ALS, which is composed of 2 jobs: a parallel matrix factorization job and a recommendation job.
+The matrix factorization job computes the user-to-feature matrix and item-to-feature matrix given the user to item ratings. Its input includes: 
+<pre>
+    --input: directory containing files of explicit user to item rating or implicit feedback;
+    --output: output path of the user-feature matrix and feature-item matrix;
+    --lambda: regularization parameter to avoid overfitting;
+    --alpha: confidence parameter only used on implicit feedback
+    --implicitFeedback: boolean flag to indicate whether the input dataset contains implicit feedback;
+    --numFeatures: dimensions of feature space;
+    --numThreadsPerSolver: number of threads per solver mapper for concurrent execution;
+    --numIterations: number of iterations
+    --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated
+</pre>
+and it outputs the matrices in sequence file format. 
+
+The recommendation job uses the user feature matrix and item feature matrix calculated from the factorization job to compute the top-N recommendations per user. Its input includes:
+<pre>
+    --input: directory containing files of user ids;
+    --output: output path of the recommended items for each input user id;
+    --userFeatures: path to the user feature matrix;
+    --itemFeatures: path to the item feature matrix;
+    --numRecommendations: maximum number of recommendations per user, default is 10;
+    --maxRating: maximum rating available;
+    --numThreads: number of threads per mapper;
+    --usesLongIDs: boolean flag to indicate whether the input contains long IDs that need to be translated;
+    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
+    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
+</pre>
+and it outputs a list of recommended item ids for each user. The predicted rating between user and item is a dot product of the user's feature vector and the item's feature vector.  
+
+##Example
+
+Let’s look at a simple example of how we could use Mahout’s ALS recommender to recommend items for users. First, you’ll need to get Mahout up and running, the instructions for which can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've ensured Mahout is properly installed, we’re ready to run the example.
+
+**Step 1: Prepare test data**
+
+Similar to Mahout's item based recommender, the ALS recommender relies on the user to item preference data: *userID*, *itemID* and *preference*. The preference could be explicit numeric rating or counts of actions such as a click (implicit feedback). The test data file is organized as each line is a tab-delimited string, the 1st field is user id, which must be numeric, the 2nd field is item id, which must be numeric and the 3rd field is preference, which should also be a number.
+
+**Note:** You must create IDs that are ordinal positive integers for all user and item IDs. Often this will require you to keep a dictionary
+to map into and out of Mahout IDs. For instance if the first user has ID "xyz" in your application, this would get an Mahout ID of the integer 1 and so on. The same
+for item IDs. Then after recommendations are calculated you will have to translate the Mahout user and item IDs back into your application IDs.
+
+To quickly start, you could specify a text file like following as the input:
+<pre>
+1	100	1
+1	200	5
+1	400	1
+2	200	2
+2	300	1
+</pre>
+
+**Step 2: Determine parameters**
+
+In addition, users need to determine dimension of feature space, the number of iterations to run the alternating least square algorithm, Using 10 features and 15 iterations is a reasonable default to try first. Optionally a confidence parameter can be set if the input preference is implicit user feedback.  
+
+**Step 3: Run ALS**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
+
+    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 --implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  --numThreadsPerSolver 1 --tempDir tmp 
+
+Running the command will execute a series of jobs the final product of which will be an output file deposited to the output directory specified in the command syntax. The output directory contains 3 sub-directories: *M* stores the item to feature matrix, *U* stores the user to feature matrix and userRatings stores the user's ratings on the items. The *tempDir* parameter specifies the directory to store the intermediate output of the job, such as the matrix output in each iteration and each item's average rating. Using the *tempDir* will help on debugging.
+
+**Step 4: Make Recommendations**
+
+Based on the output feature matrices from step 3, we could make recommendations for users. Enter the following command:
+
+     $ mahout recommendfactorized --input $als_recommender_input --userFeatures $als_output/U/ --itemFeatures $als_output/M/ --numRecommendations 1 --output recommendations --maxRating 1
+
+The input user file is a sequence file, the sequence record key is user id and value is the user's rated item ids which will be removed from recommendation. The output file generated in our simple example will be a text file giving the recommended item ids for each user. 
+Remember to translate the Mahout ids back into your application specific ids. 
+
+There exist a variety of parameters for Mahout’s ALS recommender to accommodate custom business requirements; exploring and testing various configurations to suit your needs will doubtless lead to additional questions. Feel free to ask such questions on the [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
new file mode 100644
index 0000000..ee2c3e8
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+# Introduction to Item-Based Recommendations with Hadoop
+
+##Overview
+
+Mahout’s item based recommender is a flexible and easily implemented algorithm with a diverse range of applications. The minimalism of the primary input file’s structure and availability of ancillary filtering controls can make sourcing required data and shaping a desired output both efficient and straightforward.
+
+Typical use cases include:
+
+* Recommend products to customers via an eCommerce platform (think: Amazon, Netflix, Overstock)
+* Identify organic sales opportunities
+* Segment users/customers based on similar item preferences
+
+Broadly speaking, Mahout's item-based recommendation algorithm takes as input customer preferences by item and generates an output recommending similar items with a score indicating whether a customer will "like" the recommended item.
+
+One of the strengths of the item based recommender is its adaptability to your business conditions or research interests. For example, there are many available approaches for providing product preference. One such method is to calculate the total orders for a given product for each customer (i.e. Acme Corp has ordered Widget-A 5,678 times) while others rely on user preference captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a product two thumbs’ up).
+
+Additionally, a variety of methodologies can be implemented to narrow the focus of Mahout's recommendations, such as:
+
+* Exclude low volume or low profitability products from consideration
+* Group customers by segment or market rather than using user/customer level data
+* Exclude zero-dollar transactions, returns or other order types
+* Map product substitutions into the Mahout input (i.e. if WidgetA is a recommended item replace it with WidgetX)
+
+The item based recommender output can be easily consumed by downstream applications (i.e. websites, ERP systems or salesforce automation tools) and is configurable so users can determine the number of item recommendations generated by the algorithm.
+
+##Example
+
+Testing the item based recommender can be a simple and potentially quite rewarding endeavor. Whereas the typical sample use case for collaborative filtering focuses on utilization of, and integration with, eCommerce platforms we can instead look at a potential use case applicable to most businesses (even those without a web presence). Let’s look at how a company might use Mahout’s item based recommender to identify new sales opportunities for an existing customer base. First, you’ll need to get Mahout up and running, the instructions for which can be found [here](https://mahout.apache.org/users/basics/quickstart.html). After you've ensured Mahout is properly installed, we’re ready to run a quick example.
+
+**Step 1: Gather some test data**
+
+Mahout’s item based recommender relies on three key pieces of data: *userID*, *itemID* and *preference*. The “users” could be website visitors or simply customers that purchase products from your business. Similarly, items could be products, product groups or even pages on your website – really anything you would want to recommend to a group of users or customers. For our example let’s use customer orders as a proxy for preference. A simple count of distinct orders by customer, by product will work for this example. You’ll find as you explore ways to manipulate the item based recommender the preference value can be many things (page clicks, explicit ratings, order counts, etc.). Once your test data is gathered put it in a *.txt* file separated by commas with no column headers included.
+
+**Step 2: Pick a similarity measure**
+
+Choosing a similarity measure for use in a production environment is something that requires careful testing, evaluation and research. For our example purposes, we’ll just go with a Mahout similarity classname called *SIMILARITY_LOGLIKELIHOOD*.
+
+**Step 3: Configure the Mahout command**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed properly we’re ready to configure our syntax. Enter the following command:
+
+    $ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output --numRecommendations 25
+
+Running the command will execute a series of jobs the final product of which will be an output file deposited to the directory specified in the command syntax. The output file will contain two columns: the *userID* and an array of *itemIDs* and scores.
+
+**Step 4: Making use of the output and doing more with Mahout**
+
+The output file generated in our simple example can be transformed using your tool of choice and consumed by downstream applications. There exist a variety of configuration options for Mahout’s item based recommender to accommodate custom business requirements; exploring and testing various configurations to suit your needs will doubtless lead to additional questions. Our user community is accessible via our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html) and the book *Mahout In Action* is a fantastic (but slightly outdated) starting point. 

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
new file mode 100644
index 0000000..63de4fd
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
@@ -0,0 +1,187 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="MatrixFactorization-Intro"></a>
+# Introduction to Matrix Factorization for Recommendation Mining
+
+In the mathematical discipline of linear algebra, a matrix decomposition 
+or matrix factorization is a dimensionality reduction technique that factorizes a matrix into a product of matrices, usually two. 
+There are many different matrix decompositions, each finds use among a particular class of problems.
+
+In mahout, the SVDRecommender provides an interface to build recommender based on matrix factorization.
+The idea behind is to project the users and items onto a feature space and try to optimize U and M so that U \* (M^t) is as close to R as possible:
+
+     U is n * p user feature matrix, 
+     M is m * p item feature matrix, M^t is the conjugate transpose of M,
+     R is n * m rating matrix,
+     n is the number of users,
+     m is the number of items,
+     p is the number of features
+
+We usually use RMSE to represent the deviations between predictions and atual ratings.
+RMSE is defined as the squared root of the sum of squared errors at each known user item ratings.
+So our matrix factorization target could be mathmatically defined as:
+
+     find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5))
+     
+     SSE = sum(e(u,i)^2)
+     e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 0, 1, .. p - 1
+     K is the number of known user item ratings.
+
+<a name="MatrixFactorization-Factorizers"></a>
+
+Mahout has implemented matrix factorization based on 
+
+    (1) SGD(Stochastic Gradient Descent)
+    (2) ALSWR(Alternating-Least-Squares with Weighted-λ-Regularization).
+
+## SGD
+
+Stochastic gradient descent is a gradient descent optimization method for minimizing an objective function that is written as a su of differentiable functions.
+
+       Q(w) = sum(Q_i(w)), 
+
+where w is the parameters to be estimated,
+      Q(w) is the objective function that could be expressed as sum of differentiable functions,
+      Q_i(w) is associated with the i-th observation in the data set 
+
+In practice, w is estimated using an iterative method at each single sample until an approximate miminum is obtained,
+
+      w = w - alpha * (d(Q_i(w))/dw),
+where aplpha is the learning rate,
+      (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w.
+
+In matrix factorization, the RatingSGDFactorizer class implements the SGD with w = (U, M) and objective function Q(w) = sum(Q(u,i)),
+
+       Q(u,i) =  sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + (M[i,] * (M[i,]^t))] / 2
+
+where Q(u, i) is the objecive function for user u and item i,
+      e(u, i) is the error between predicted rating and actual rating,
+      U[u,] is the feature vector of user u,
+      M[i,] is the feature vector of item i,
+      lambda is the regularization parameter to prevent overfitting.
+
+The algorithm is sketched as follows:
+  
+      init U and M with randomized value between 0.0 and 1.0 with standard Gaussian distribution   
+      
+      for(iter = 0; iter < numIterations; iter++)
+      {
+          for(user u and item i with rating R[u,i])
+          {
+              predicted_rating = U[u,] *  M[i,]^t //dot product of feature vectors between user u and item i
+              err = R[u, i] - predicted_rating
+              //adjust U[u,] and M[i,]
+              // p is the number of features
+              for(f = 0; f < p; f++) {
+                 NU[u,f] = U[u,f] - alpha * d(Q(u,i))/d(U[u,f]) //optimize U[u,f]
+                         = U[u, f] + alpha * (e(u,i) * M[i,f] - lambda * U[u,f]) 
+              }
+              for(f = 0; f < p; f++) {
+                 M[i,f] = M[i,f] - alpha * d(Q(u,i))/d(M[i,f])  //optimize M[i,f] 
+                        = M[i,f] + alpha * (e(u,i) * U[u,f] - lambda * M[i,f]) 
+              }
+              U[u,] = NU[u,]
+          }
+      }
+
+## SVD++
+
+SVD++ is an enhancement of the SGD matrix factorization. 
+
+It could be considered as an integration of latent factor model and neighborhood based model, considering not only how users rate, but also who has rated what. 
+
+The complete model is a sum of 3 sub-models with complete prediction formula as follows: 
+    
+    pr(u,i) = b[u,i] + fm + nm   //user u and item i
+    
+    pr(u,i) is the predicted rating of user u on item i,
+    b[u,i] = U + b(u) + b(i)
+    fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])),  j is an item in N(u)
+    nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + pow(|N(i;u;k)|, -0.5) * sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item in N(i;u;k)
+
+The associated regularized squared error function to be minimized is:
+
+    {sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i]))  - lambda * (b(u) * b(u) + b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * w[i,j0]) + sum(c[i,j1] * c[i,j1]))}
+
+b[u,i] is the baseline estimate of user u's predicted rating on item i. U is users' overall average rating and b(u) and b(i) indicate the observed deviations of user u and item i's ratings from average. 
+
+The baseline estimate is to adjust for the user and item effects - i.e, systematic tendencies for some users to give higher ratings than others and tendencies
+for some items to receive higher ratings than other items.
+
+fm is the latent factor model to capture the interactions between user and item via a feature layer. q[i,] is the feature vector of item i, and the rest part of the formula represents user u with a user feature vector and a sum of features of items in N(u),
+N(u) is the set of items that user u have expressed preference, y[j,] is feature vector of an item in N(u).
+
+nm is an extension of the classic item-based neighborhood model. 
+It captures not only the user's explicit ratings but also the user's implicit preferences. R(i;u;k) is the set of items that have got explicit rating from user u and only retain top k most similar items. r[u,j0] is the actual rating of user u on item j0, 
+b[u,j0] is the corresponding baseline estimate.
+
+The difference between r[u,j0] and b[u,j0] is weighted by a parameter w[i,j0], which could be thought as the similarity between item i and j0. 
+
+N[i;u;k] is the top k most similar items that have got the user's preference.
+c[i;j1] is the paramter to be estimated. 
+
+The value of w[i,j0] and c[i,j1] could be treated as the significance of the 
+user's explicit rating and implicit preference respectively.
+
+The parameters b, y, q, w, c are to be determined by minimizing the the associated regularized squared error function through gradient descent. We loop over all known ratings and for a given training case r[u,i], we apply gradient descent on the error function and modify the parameters by moving in the opposite direction of the gradient.
+
+For a complete analysis of the SVD++ algorithm,
+please refer to the paper [Yehuda Koren: Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, KDD 2008](http://research.yahoo.com/files/kdd08koren.pdf).
+ 
+In Mahout,SVDPlusPlusFactorizer class is a simplified implementation of the SVD++ algorithm.It mainly uses the latent factor model with item feature vector, user feature vector and user's preference, with pr(u,i) = fm = (q[i,]) \* (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) and the parameters to be determined are q, p, y. 
+
+The update to q, p, y in each gradient descent step is:
+
+      err(u,i) = r[u,i] - pr[u,i]
+      q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) - lamda * q[i,]) 
+      p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,])
+      for j that is an item in N(u):
+         y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - lambda * y[j,])
+
+where alpha is the learning rate of gradient descent, N(u) is the items that user u has expressed preference.
+
+## Parallel SGD
+
+Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It shuffles the user ratings in every iteration and 
+generates splits on the shuffled ratings. Each split is handled by a thread to update the user features and item features using 
+vanilla SGD. 
+
+The implementation could be traced back to a lock-free version of SGD based on paper 
+[Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf).
+
+## ALSWR
+
+ALSWR is an iterative algorithm to solve the low rank factorization of user feature matrix U and item feature matrix M.  
+The loss function to be minimized is formulated as the sum of squared errors plus [Tikhonov regularization](http://en.wikipedia.org/wiki/Tikhonov_regularization):
+
+     L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * (sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2))
+ 
+At the beginning of the algorithm, M is initialized with the average item ratings as its first row and random numbers for the rest row.  
+
+In every iteration, we fix M and solve U by minimization of the cost function L(R, U, M), then we fix U and solve M by the minimization of 
+the cost function similarly. The iteration stops until a certain stopping criteria is met.
+
+To solve the matrix U when M is given, each user's feature vector is calculated by resolving a regularized linear least square error function 
+using the items the user has rated and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 
+
+Similary, when M is updated, we resolve a regularized linear least square error function using feature vectors of the users that have rated the 
+item and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(M[i,f]) = 0
+
+The ALSWRFactorizer class is a non-distributed implementation of ALSWR using multi-threading to dispatch the computation among several threads.
+Mahout also offers a [parallel map-reduce implementation](https://mahout.apache.org/users/recommender/intro-als-hadoop.html).
+
+<a name="MatrixFactorization-Reference"></a>
+# Reference:
+
+[Stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
+    
+[ALSWR](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
new file mode 100644
index 0000000..8ba5b28
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
@@ -0,0 +1,277 @@
+---
+layout: default
+title: Recommender Documentation
+theme:
+    name: retro-mahout
+---
+
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+
+Top-level packages define the Mahout interfaces to these key abstractions:
+
+* **DataModel**
+* **UserSimilarity**
+* **ItemSimilarity**
+* **UserNeighborhood**
+* **Recommender**
+
+Subpackages of *org.apache.mahout.cf.taste.impl* hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! 
+
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+
+![doc](../../images/taste-architecture.png)
+
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+**GenericUserBasedRecommender** or **GenericItemBasedRecommender**,
+possibly decorated by **CachingRecommender**.
+
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A **DataModel** is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Be sure to wrap this with a **ReloadFromJDBCDataModel** to get good performance! Mahout provides **MySQLJDBCDataModel**, for example, to access preference data from a database via JDBC and MySQL. Another exists for PostgreSQL. Mahout also provides a **FileDataModel**, which is fine for small applications.
+
+Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A **Preference** object or **PreferenceArray** object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A **UserSimilarity** defines a notion of similarity between two users. This is
+a crucial part of a recommendation engine. These are attached to a
+**Neighborhood** implementation. **ItemSimilarity** is analagous, but find
+similarity between items.
+
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A **UserNeighborhood**
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a **UserSimilarity** to
+operate.
+
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender systems. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a **DataModel** of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the **PearsonCorrelationSimilarity** implementation of **UserSimilarity**
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+
+
+Now we create a **UserNeighborhood** algorithm. Here we use nearest-3:
+
+
+    UserNeighborhood neighborhood =
+    	  new NearestNUserNeighborhood(3, userSimilarity, model);{code}
+    
+Now we can create our **Recommender**, and add a caching decorator:
+    
+
+    Recommender recommender =
+	  new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+
+    
+Now we can get 10 recommendations for user ID "1234" &mdash; done!
+
+    List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+    
+## Item-based Recommender
+    
+We could have created an item-based recommender instead. Item-based
+recommenders base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    
+Let's start over, again with a **FileDataModel** to start:
+    
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+    
+We'll also need an **ItemSimilarity**. We could use
+**PearsonCorrelationSimilarity**, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+**GenericItemSimilarity**: 
+    
+
+    // Construct the list of pre-computed correlations
+    Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+	  ...;
+    ItemSimilarity itemSimilarity =
+	  new GenericItemSimilarity(correlations);
+
+
+    
+Then we can finish as before to produce recommendations:
+    
+
+    Recommender recommender =
+	  new GenericItemBasedRecommender(model, itemSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+    ...
+    List<RecommendedItem> recommendations =
+	  cachingRecommender.recommend(1234, 10);
+
+
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use **CachingRecommender** on top of your custom **Recommender** implementation.
+* When using **JDBCDataModel**, make sure you wrap it with the **ReloadFromJDBCDataModel** to load data into memory!. 
+
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval
+
+
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+    	  new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try a *RecommenderIRStatsEvaluator*, which presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful.
+
+
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive Algorithms for Collaborative Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an open architecture for collaborative filtering of netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic framework for performing collaborative filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+1999.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
new file mode 100644
index 0000000..2b090e6
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Recommender First-Timer FAQ
+theme:
+    name: retro-mahout
+---
+
+# Recommender First Timer Dos and Don'ts
+
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in Action](http://manning.com/owen/).
+
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
new file mode 100644
index 0000000..da17b38
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
@@ -0,0 +1,133 @@
+---
+layout: default
+title: User Based Recommender in 5 Minutes
+theme:
+    name: retro-mahout
+---
+
+# Creating a User-Based Recommender in 5 minutes
+
+##Prerequisites
+
+Create a java project in your favorite IDE and make sure mahout is on the classpath. The easiest way to accomplish this is by importing it via maven as described on the [Quickstart](/users/basics/quickstart.html) page.
+
+
+## Dataset
+
+Mahout's recommenders expect interactions between users and items as input. The easiest way to supply such data to Mahout is in the form of a textfile, where every line has the format *userID,itemID,value*. Here *userID* and *itemID* refer to a particular user and a particular item, and *value* denotes the strength of the interaction (e.g. the rating given to a movie).
+
+In this example, we'll use some made up data for simplicity. Create a file called "dataset.csv" and copy the following example interactions into the file. 
+
+<pre>
+1,10,1.0
+1,11,2.0
+1,12,5.0
+1,13,5.0
+1,14,5.0
+1,15,4.0
+1,16,5.0
+1,17,1.0
+1,18,5.0
+2,10,1.0
+2,11,2.0
+2,15,5.0
+2,16,4.5
+2,17,1.0
+2,18,5.0
+3,11,2.5
+3,12,4.5
+3,13,4.0
+3,14,3.0
+3,15,3.5
+3,16,4.5
+3,17,4.0
+3,18,5.0
+4,10,5.0
+4,11,5.0
+4,12,5.0
+4,13,0.0
+4,14,2.0
+4,15,3.0
+4,16,1.0
+4,17,4.0
+4,18,1.0
+</pre>
+
+## Creating a user-based recommender
+
+Create a class called *SampleRecommender* with a main method.
+
+The first thing we have to do is load the data from the file. Mahout's recommenders use an interface called *DataModel* to handle interaction data. You can load our made up interactions like this:
+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+</pre>
+
+In this example, we want to create a user-based recommender. The idea behind this approach is that when we want to compute recommendations for a particular users, we look for other users with a similar taste and pick the recommendations from their items. For finding similar users, we have to compare their interactions. There are several methods for doing this. One popular method is to compute the [correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) between their interactions. In Mahout, you use this method as follows:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
+</pre>
+
+The next thing we have to do is to define which similar users we want to leverage for the recommender. For the sake of simplicity, we'll use all that have a similarity greater than *0.1*. This is implemented via a *ThresholdUserNeighborhood*:
+
+<pre>UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);</pre>
+
+Now we have all the pieces to create our recommender:
+
+<pre>
+UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
+</pre>
+        
+We can easily ask the recommender for recommendations now. If we wanted to get three items recommended for the user with *userID* 2, we would do it like this:
+	
+
+<pre>
+List<RecommendedItem> recommendations = recommender.recommend(2, 3);
+for (RecommendedItem recommendation : recommendations) {
+  System.out.println(recommendation);
+}
+</pre>
+
+
+Congratulations, you have built your first recommender!
+
+
+## Evaluation
+
+You might ask yourself, how to make sure that your recommender returns good results. Unfortunately, the only way to be really sure about the quality is by doing an A/B test with real users in a live system.
+
+We can however try to get a feel of the quality, by statistical offline evaluation. Just keep in mind that this does not replace a test with real users!
+
+One way to check whether the recommender returns good results is by doing a **hold-out** test. We partition our dataset into two sets: a trainingset consisting of 90% of the data and a testset consisting of 10%. Then we train our recommender using the training set and look how well it predicts the unknown interactions in the testset.
+
+To test our recommender, we create a class called *EvaluateRecommender* with a main method and add an inner class called *MyRecommenderBuilder* that implements the *RecommenderBuilder* interface. We implement the *buildRecommender* method and make it setup our user-based recommender:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
+UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, dataModel);
+return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
+</pre>
+
+Now we have to create the code for the test. We'll check how much the recommender misses the real interaction strength on average. We employ an *AverageAbsoluteDifferenceRecommenderEvaluator* for this. The following code shows how to put the pieces together and run a hold-out test: 
+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
+RecommenderBuilder builder = new MyRecommenderBuilder();
+double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
+System.out.println(result);
+</pre>
+
+Note: if you run this test multiple times, you will get different results, because the splitting into trainingset and testset is done randomly. 
+
+
+
+
+
+
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/powered-by-mahout.md b/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
new file mode 100644
index 0000000..cb7c039
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
@@ -0,0 +1,129 @@
+---
+layout: default
+title: Powered By Mahout
+theme:
+    name: retro-mahout
+---
+
+# Powered by Mahout
+
+Are you using Mahout to do Machine Learning? <a href="https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html">Care to share</a>? Developers of the project always are happy to learn about new happy users with interesting use cases.
+
+*Links here do NOT imply
+endorsement by Mahout, its committers or the Apache Software Foundation and
+are for informational purposes only.*
+
+<a name="PoweredByMahout-CommercialUse"></a>
+## Commercial Use
+
+* <a href="http://nosql.mypopescu.com/post/2082712431/hbase-and-hadoop-at-adobe">Adobe AMP</a> uses Mahout's clustering algorithms to increase video
+consumption by better user targeting. 
+* Accenture uses Mahout as typical example for their [Hadoop Deployment Comparison Study](http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Hadoop-Deployment-Comparison-Study.pdf)
+* [AOL](http://www.aol.com)
+ use Mahout for shopping recommendations. See [slide deck](http://www.slideshare.net/kryton/the-data-layer)
+* [Booz Allen Hamilton](http://www.boozallen.com/)
+ uses Mahout's clustering algorithms. See [slide deck](http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010)
+* [Buzzlogic](http://www.buzzlogic.com)
+ uses Mahout's clustering algorithms to improve ad targeting
+* [Cull.tv](http://cull.tv/)
+ uses modified Mahout algorithms for content recommendations
+* ![DatamineLab](http://cdn.dataminelab.com/favicon.ico) [DataMine Lab](http://dataminelab.com)
+ uses Mahout's recommendation and clustering algorithms to improve our
+clients' ad targeting.
+* [Drupal](http://drupal.org/project/recommender)
+ uses Mahout to provide open source content recommendation solutions.
+* [Evolv ](http://www.evolvondemand.com)
+ uses Mahout for its Workforce Predictive Analytics platform.
+* [Foursquare](http://www.foursquare.com)
+ uses Mahout for its [recommendation engine](http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/).
+* [Idealo](http://www.idealo.de)
+ uses Mahout's recommendation engine.
+* [InfoGlutton](http://www.infoglutton.com)
+ uses Mahout's clustering and classification for various consulting
+projects.
+* [Intel](http://mark.chmarny.com/2013/07/thinking-big-about-data-at-intel.html)
+ ships Mahout as part of their Distribution for Apache Hadoop Software.
+* [Intela](http://www.intela.com/)
+ has implementations of Mahout's recommendation algorithms to select new
+offers to send tu customers, as well as to recommend potential customers to
+current offers. We are also working on enhancing our offer categories by
+using the clustering algorithms.
+* ![iOffer](http://ioffer.com/favicon.ico) [iOffer](http://www.ioffer.com)
+ uses Mahout's Frequent Pattern Mining and Collaborative Filtering to
+recommend items to users.
+* ![kau.li](http://kau.li/favicon.ico) [Kauli](http://kau.li/en)
+, one of Japanese Adnetwork, uses Mahout's clustering to handle clickstream
+data for predicting audience's interests and intents.
+* [Linked.In](http://linkedin.com)
+ Historically, we have used R for model training. We have recently started
+experimenting with Mahout for model training and are excited about it - also see
+ <a href="https://www.quora.com/LinkedIn-Recommendations/How-does-LinkedIns-recommendation-system-work?srid=XoeG&share=1">Hadoop World slides</a>
+.
+* [LucidWorks Big Data](http://www.lucidworks.com/products/lucidworks-big-data)
+ uses Mahout for clustering, duplicate document detection, phrase
+extraction and classification.
+* ![Mendeley](http://mendeley.com/favicon.ico) [Mendeley](http://mendeley.com)
+ uses Mahout to power Mendeley Suggest, a research article recommendation
+service.
+* ![Mippin](http://mippin.com/web/favicon.ico) [Mippin](http://mippin.com)
+ uses Mahout's collaborative filtering engine to recommend news feeds
+* [Mobage](http://www.slideshare.net/hamadakoichi/mobage-prmu-2011-mahout-hadoop)
+ uses Mahout in their analysis pipeline
+* ![Myrrix](http://myrrix.com/wp-content/uploads/2012/03/favicon.ico) [Myrrix](http://myrrix.com)
+ is a recommender system product built on Mahout.
+* ![Newscred](http://www.newscred.com/static/img/website/favicon.ico) [NewsCred](http://platform.newscred.com)
+ uses Mahout to generate clusters of news articles and to surface the
+important stories of the day
+* [Next Glass](http://nextglass.co/)
+ uses Mahout
+* [Predixion Software](http://predixionsoftware.com/)
+ uses Mahout’s algorithms to build predictive models on big data
+* <img src="http://www.radoop.eu/wp-content/uploads/favicon.png" width=15> [Radoop](http://radoop.eu)
+ provides a drag-n-drop interface for big data analytics, including Mahout
+clustering and classification algorithms
+* ![Researchgate](https://www.researchgate.net/favicon.ico) [ResearchGate](http://www.researchgate.net/), the professional network for scientists and researchers, uses Mahout's
+recommendation algorithms.
+* [Sematext](http://www.sematext.com/)
+ uses Mahout for its recommendation engine
+* [SpeedDate.com](http://www.speeddate.com)
+ uses Mahout's collaborative filtering engine to recommend member profiles
+* [Twitter](http://twitter.com)
+ uses Mahout's LDA implementation for user interest modeling
+* [Yahoo\!](http://www.yahoo.com)
+ Mail uses Mahout's Frequent Pattern Set Mining.  See [slides](http://www.slideshare.net/hadoopusergroup/mail-antispam)
+* [365Media ](http://365media.com/)
+ uses *Mahout's* Classification and Collaborative Filtering algorithms in
+its Real-time system named [UPTIME](http://uptime.365media.com/)
+ and 365Media/Social
+
+<a name="PoweredByMahout-AcademicUse"></a>
+## Academic Use
+
+* [Dicode](https://www.dicode-project.eu/)
+ project uses Mahout's clustering and classification algorithms on top of
+HBase.
+* The course [Large Scale Data Analysis and Data Mining](http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-3/)
+ at TU Berlin uses Mahout to teach students about the parallelization of data
+mining problems with Hadoop and Map/Reduce
+* Mahout is used at Carnegie Mellon University, as a comparable platform to [GraphLab](http://www.graphlab.ml.cmu.edu/)
+
+* The [ROBUST project](http://www.robust-project.eu/)
+, co-funded by the European Commission, employs Mahout in the large scale
+analysis of online community data.
+* Mahout is used for research and data processing at [Nagoya Institute of Technology](http://www.nitech.ac.jp/eng/schools/grad/cse.html)
+, in the context of a large-scale citizen participation platform project,
+funded by the Ministry of Interior of Japan.
+* Several researches within [Digital Enterprise Research Institute](http://www.deri.ie)
+ [NUI Galway](http://www.nuigalway.ie)
+ use Mahout for e.g. topic mining and modelling of large corpora.
+* Mahout is used in the NoTube EU project.
+
+<a name="PoweredByMahout-PoweredByLogos"></a>
+## Powered By Logos
+
+Feel free to use our **Powered By** logos on your site:
+
+![powered by logo](https://mahout.apache.org/images/mahout-logo-poweredby-55.png)
+
+
+![powered by logo](https://mahout.apache.org/images/mahout-logo-poweredby-100.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
new file mode 100644
index 0000000..14dd276
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
@@ -0,0 +1,291 @@
+---
+layout: default
+title: Creating Vectors from Text
+theme:
+    name: retro-mahout
+---
+
+
+# Creating vectors from text
+<a name="CreatingVectorsfromText-Introduction"></a>
+# Introduction
+
+For clustering and classifying documents it is usually necessary to convert the raw text
+into vectors that can then be consumed by the clustering [Algorithms](algorithms.html).  These approaches are described below.
+
+<a name="CreatingVectorsfromText-FromLucene"></a>
+# From Lucene
+
+*NOTE: Your Lucene index must be created with the same version of Lucene
+used in Mahout.  As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont match you will likely get "Exception in thread "main"
+org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
+as an error.*
+
+Mahout has utilities that allow one to easily produce Mahout Vector
+representations from a Lucene (and Solr, since they are they same) index.
+
+For this, we assume you know how to build a Lucene/Solr index.	For those
+who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr)
+ as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
+index.	For those wanting to use just Lucene, see the [Lucene website](http://lucene.apache.org/core)
+ or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
+McCandless.
+
+To get started, make sure you get a fresh copy of Mahout from [GitHub](http://mahout.apache.org/developers/buildingmahout.html)
+ and are comfortable building it. It defines interfaces and implementations
+for efficiently iterating over a data source (it only supports Lucene
+currently, but should be extensible to databases, Solr, etc.) and produces
+a Mahout Vector file and term dictionary which can then be used for
+clustering.   The main code for driving this is the driver program located
+in the org.apache.mahout.utils.vectors package.  The driver program offers
+several input options, which can be displayed by specifying the --help
+option.  Examples of running the driver are included below:
+
+<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
+#### Generating an output file from a Lucene Index
+
+
+    $MAHOUT_HOME/bin/mahout lucene.vector 
+        --dir (-d) dir                     The Lucene directory      
+        --idField idField                  The field in the index    
+                                               containing the index.  If 
+                                               null, then the Lucene     
+                                               internal doc id is used   
+                                               which is prone to error   
+                                               if the underlying index   
+                                               changes                   
+        --output (-o) output               The output file           
+        --delimiter (-l) delimiter         The delimiter for         
+                                               outputting the dictionary 
+        --help (-h)                        Print out help            
+        --field (-f) field                 The field in the index    
+        --max (-m) max                         The maximum number of     
+                                               vectors to output.  If    
+                                               not specified, then it    
+                                               will loop over all docs   
+        --dictOut (-t) dictOut             The output of the         
+                                               dictionary                
+        --seqDictOut (-st) seqDictOut      The output of the         
+                                               dictionary as sequence    
+                                               file                      
+        --norm (-n) norm                   The norm to use,          
+                                               expressed as either a     
+                                               double or "INF" if you    
+                                               want to use the Infinite  
+                                               norm.  Must be greater or 
+                                               equal to 0.  The default  
+                                               is not to normalize       
+        --maxDFPercent (-x) maxDFPercent   The max percentage of     
+                                               docs for the DF.  Can be  
+                                               used to remove really     
+                                               high frequency terms.     
+                                               Expressed as an integer   
+                                               between 0 and 100.        
+                                               Default is 99.            
+        --weight (-w) weight               The kind of weight to     
+                                               use. Currently TF or      
+                                               TFIDF                     
+        --minDF (-md) minDF                The minimum document      
+                                               frequency.  Default is 1  
+        --maxPercentErrorDocs (-err) mErr  The max percentage of     
+                                               docs that can have a null 
+                                               term vector. These are    
+                                               noise document and can    
+                                               occur if the analyzer     
+                                               used strips out all terms 
+                                               in the target field. This 
+                                               percentage is expressed   
+                                               as a value between 0 and  
+                                               1. The default is 0.  
+  
+#### Example: Create 50 Vectors from an Index 
+
+    $MAHOUT_HOME/bin/mahout lucene.vector
+        --dir $WORK_DIR/wikipedia/solr/data/index 
+        --field body 
+        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
+        --output $WORK_DIR/solr/wikipedia/out.txt 
+        --max 50
+
+
+This uses the index specified by --dir and the body field in it and writes
+out the info to the output dir and the dictionary to dict.txt.	It only
+outputs 50 vectors.  If you don't specify --max, then all the documents in
+the index are output.
+
+<a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a>
+#### Example: Creating 50 Normalized Vectors from a Lucene Index using the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
+
+    $MAHOUT_HOME/bin/mahout lucene.vector 
+        --dir $WORK_DIR/wikipedia/solr/data/index 
+        --field body 
+        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
+        --output $WORK_DIR/solr/wikipedia/out.txt 
+        --max 50 
+        --norm 2
+
+
+<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a>
+## From A Directory of Text documents
+Mahout has utilities to generate Vectors from a directory of text
+documents. Before creating the vectors, you need to convert the documents
+to SequenceFile format. SequenceFile is a hadoop class which allows us to
+write arbitary (key, value) pairs into it. The DocumentVectorizer requires the
+key to be a Text with a unique document id, and value to be the Text
+content in UTF-8 format.
+
+You may find [Tika](http://tika.apache.org/) helpful in converting
+binary documents to text.
+
+<a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a>
+#### Converting directory of documents to SequenceFile format
+Mahout has a nifty utility which reads a directory path including its
+sub-directories and creates the SequenceFile in a chunked manner for us.
+
+    $MAHOUT_HOME/bin/mahout seqdirectory 
+        --input (-i) input                       Path to job input directory.   
+        --output (-o) output                     The directory pathname for     
+                                                     output.                        
+        --overwrite (-ow)                        If present, overwrite the      
+                                                     output directory before        
+                                                     running job                    
+        --method (-xm) method                    The execution method to use:   
+                                                     sequential or mapreduce.       
+                                                     Default is mapreduce           
+        --chunkSize (-chunk) chunkSize           The chunkSize in MegaBytes.    
+                                                     Defaults to 64                 
+        --fileFilterClass (-filter) fFilterClass The name of the class to use   
+                                                     for file parsing. Default:     
+                                                     org.apache.mahout.text.PrefixAdditionFilter                   
+        --keyPrefix (-prefix) keyPrefix          The prefix to be prepended to  
+                                                     the key                        
+        --charset (-c) charset                   The name of the character      
+                                                     encoding of the input files.   
+                                                     Default to UTF-8 {accepts: cp1252|ascii...}             
+        --method (-xm) method                    The execution method to use:   
+                                                     sequential or mapreduce.       
+                                                 Default is mapreduce           
+        --overwrite (-ow)                        If present, overwrite the      
+                                                     output directory before        
+                                                     running job                    
+        --help (-h)                              Print out help                 
+        --tempDir tempDir                        Intermediate output directory  
+        --startPhase startPhase                  First phase to run             
+        --endPhase endPhase                      Last phase to run  
+
+The output of seqDirectory will be a Sequence file < Text, Text > of all documents (/sub-directory-path/documentFileName, documentText).
+
+<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
+#### Creating Vectors from SequenceFile
+
+From the sequence file generated from the above step run the following to
+generate vectors. 
+
+
+    $MAHOUT_HOME/bin/mahout seq2sparse
+        --minSupport (-s) minSupport      (Optional) Minimum Support. Default       
+                                              Value: 2                                  
+        --analyzerName (-a) analyzerName  The class name of the analyzer            
+        --chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes. Default       
+                                              Value: 100MB                              
+        --output (-o) output              The directory pathname for output.        
+        --input (-i) input                Path to job input directory.              
+        --minDF (-md) minDF               The minimum document frequency.  Default  
+                                              is 1                                      
+        --maxDFSigma (-xs) maxDFSigma     What portion of the tf (tf-idf) vectors   
+                                              to be used, expressed in times the        
+                                              standard deviation (sigma) of the         
+                                              document frequencies of these vectors.    
+                                              Can be used to remove really high         
+                                              frequency terms. Expressed as a double    
+                                              value. Good value to be specified is 3.0. 
+                                              In case the value is less than 0 no       
+                                              vectors will be filtered out. Default is  
+                                              -1.0.  Overrides maxDFPercent             
+        --maxDFPercent (-x) maxDFPercent  The max percentage of docs for the DF.    
+                                              Can be used to remove really high         
+                                              frequency terms. Expressed as an integer  
+                                              between 0 and 100. Default is 99.  If     
+                                              maxDFSigma is also set, it will override  
+                                              this value.                               
+        --weight (-wt) weight             The kind of weight to use. Currently TF   
+                                              or TFIDF. Default: TFIDF                  
+        --norm (-n) norm                  The norm to use, expressed as either a    
+                                              float or "INF" if you want to use the     
+                                              Infinite norm.  Must be greater or equal  
+                                              to 0.  The default is not to normalize    
+        --minLLR (-ml) minLLR             (Optional)The minimum Log Likelihood      
+                                              Ratio(Float)  Default is 1.0              
+        --numReducers (-nr) numReducers   (Optional) Number of reduce tasks.        
+                                              Default Value: 1                          
+        --maxNGramSize (-ng) ngramSize    (Optional) The maximum size of ngrams to  
+                                              create (2 = bigrams, 3 = trigrams, etc)   
+                                              Default Value:1                           
+        --overwrite (-ow)                 If set, overwrite the output directory    
+        --help (-h)                           Print out help                            
+        --sequentialAccessVector (-seq)   (Optional) Whether output vectors should  
+                                              be SequentialAccessVectors. Default is false;
+                                              true required for running some algorithms
+                                              (LDA,Lanczos)                                
+        --namedVector (-nv)               (Optional) Whether output vectors should  
+                                              be NamedVectors. If set true else false   
+        --logNormalize (-lnorm)           (Optional) Whether output vectors should  
+                                              be logNormalize. If set true else false
+
+
+
+This will create SequenceFiles of tokenized documents < Text, StringTuple >  (docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, TF-IDF Vector).  
+
+As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, word), a word frequency count (wordIndex, count) and a document frequency count (wordIndex, DFCount) in the output directory. 
+
+The --minSupport option is the min frequency for the word to be considered as a feature; --minDF is the min number of documents the word needs to be in; --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words.
+
+The vectorized documents can then be used as input to many of Mahout's classification and clustering algorithms.
+
+#### Example: Creating Normalized [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Vectors from a directory of text documents using [trigrams](http://en.wikipedia.org/wiki/N-gram) and the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
+Create sequence files from the directory of text documents:
+    
+    $MAHOUT_HOME/bin/mahout seqdirectory 
+        -i $WORK_DIR/reuters 
+        -o $WORK_DIR/reuters-seqdir 
+        -c UTF-8
+        -chunk 64
+        -xm sequential
+
+Vectorize the documents using trigrams, L_2 length normalization and a maximum document frequency cutoff of 85%.
+
+    $MAHOUT_HOME/bin/mahout seq2sparse 
+        -i $WORK_DIR/reuters-out-seqdir/ 
+        -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans 
+        --namedVec
+        -wt tfidf
+        -ng 3
+        -n 2
+        --maxDFPercent 85 
+
+The sequence file in the $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be used as input to the Mahout [k-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html) clustering algorithm.
+
+<a name="CreatingVectorsfromText-Background"></a>
+## Background
+
+* [Discussion on centroid calculations with sparse vectors](http://markmail.org/thread/l5zi3yk446goll3o)
+
+<a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
+## Converting existing vectors to Mahout's format
+
+If you are in the happy position to already own a document (as in: texts,
+images or whatever item you wish to treat) processing pipeline, the
+question arises of how to convert the vectors into the Mahout vector
+format. Probably the easiest way to go would be to implement your own
+Iterable<Vector> (called VectorIterable in the example below) and then
+reuse the existing VectorWriter classes:
+
+
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+                                                          configuration,
+                                                          outfile,
+                                                          LongWritable.class,
+                                                          SparseVector.class);
+
+    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/creating-vectors.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/creating-vectors.md b/website/old_site_migration/needs_work_priority/creating-vectors.md
new file mode 100644
index 0000000..10cbd8e
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/creating-vectors.md
@@ -0,0 +1,16 @@
+---
+layout: default
+title: Creating Vectors
+theme:
+    name: retro-mahout
+---
+
+
+<a name="CreatingVectors-UtilitiesforCreatingVectors"></a>
+# Utilities for Creating Vectors
+
+1. [Text](creating-vectors-from-text.html) ... utilities to turn plain text into Mahout vectors.
+
+1. Mahout also has rudimentary support for the arff file format. See [arff junit doc](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/arff/package-summary.html).
+
+1. There is also support for reading vectors from [csv files](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/csv/package-summary.html).


Mime
View raw message