mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Bickson <danny.bick...@gmail.com>
Subject Re: How to input a matrix to use SVD in mahout
Date Fri, 23 Sep 2011 08:41:07 GMT
Hi!
You can find detailed Java code to convert your example to Mahout SVD format
on my blog here:
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html

Since I know some Chinese users a blocked to goole websites, here is the
content:

Best,

Danny Bickson

               Friday, February 4, 2011  Mahout - SVD matrix factorization -
formatting input matrix
 Converting Input Format into Mahout's SVD Distributed Matrix Factorization
Solver

Purpose
The code below, converts a matrix from csv format:
<from row>,<to col>,<value>\n
Into Mahout's SVD solver format.


For example,
The 3x3 matrix:
0    1.0 2.1
3.0  4.0 5.0
-5.0 6.2 0


Will be given as input in a csv file as:
1,0,3.0
2,0,-5.0
0,1,1.0
1,1,4.0
2,1,6.2
0,2,2.1
1,2,5.0

NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
This code is based on code by Danny Leshem, ContextIn.

Command line arguments:
 args[0] - path to csv input file
args[1] - cardinality of the matrix (number of columns)
args[2] - path the resulting Mahout's SVD input file

Method:
The code below, goes over the csv file, and for each matrix column, creates
a SequentialAccessSparseVector which contains all the non-zero row entries
for this column.
Then it appends the column vector to file.

Compilation:
Copy the java code below into an java file named Convert2SVD.java
Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a
command line option for compilation is given below.


view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>

   1. import java.io.BufferedReader;
   2. import java.io.FileReader;
   3. import java.util.StringTokenizer;
   4.
   5. import org.apache.mahout.math.SequentialAccessSparseVector;
   6. import org.apache.mahout.math.Vector;
   7. import org.apache.mahout.math.VectorWritable;
   8. import org.apache.hadoop.conf.Configuration;
   9. import org.apache.hadoop.fs.FileSystem;
   10. import org.apache.hadoop.fs.Path;
   11. import org.apache.hadoop.io.IntWritable;
   12. import org.apache.hadoop.io.SequenceFile;
   13. import org.apache.hadoop.io.SequenceFile.CompressionType;
   14.
   15. /**
   16.  * Code for converting CSV format to Mahout's SVD format
   17.  * @author Danny Bickson, CMU
   18.
    * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
SECOND FIELD).

   19.  *
   20.  */
   21.
   22. public class Convert2SVD {
   23.
   24.
   25.         public static int Cardinality;
   26.
   27.         /**
   28.          *
   29.          * @param args[0] - input csv file
   30.          * @param args[1] - cardinality (length of vector)
   31.          * @param args[2] - output file for svd
   32.          */
   33.         public static void main(String[] args){
   34.
   35. try {
   36.         Cardinality = Integer.parseInt(args[1]);
   37.         final Configuration conf = new Configuration();
   38.         final FileSystem fs = FileSystem.get(conf);
   39.         final
    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new
    Path(args[2]), IntWritable.class, VectorWritable.class
   , CompressionType.BLOCK);
   40.
   41.           final IntWritable key = new IntWritable();
   42.           final VectorWritable value = new VectorWritable();
   43.
   44.
   45.            String thisLine;
   46.
   47.            BufferedReader br = new BufferedReader(new
    FileReader(args[0]));
   48.            Vector vector = null;
   49.            int from = -1,to  =-1;
   50.            int last_to = -1;
   51.            float val = 0;
   52.            int total = 0;
   53.            int nnz = 0;
   54.            int e = 0;
   55.            int max_to =0;
   56.            int max_from = 0;
   57.
   58.            while ((thisLine = br.readLine()) != null) {
   // while loop begins here
   59.
   60.                  StringTokenizer st = new StringTokenizer(thisLine,
   ",");
   61.                  while(st.hasMoreTokens()) {
   62.                      from = Integer.parseInt(st.nextToken())-1;
   //convert from 1 based to zero based
   63.                      to = Integer.parseInt(st.nextToken())-1;
   //convert from 1 based to zero basd
   64.                      val = Float.parseFloat(st.nextToken());
   65.                      if (max_from < from) max_from = from;
   66.                      if (max_to < to) max_to = to;
   67.                      if (from < 0 || to < 0
    || to > Cardinality || val == 0.0)
   68.                          throw new NumberFormatException("wrong data"
    + from + " to: " + to + " val: " + val);
   69.                  }
   70.
   71.
   //we are working on an existing column, set non-zero rows in it
   72.                  if (last_to != to && last_to != -1){
   73.                      value.set(vector);
   74.
   75.                      writer.append(key, value);
   //write the older vector
   76.                      e+= vector.getNumNondefaultElements();
   77.                  }
   78.                  //a new column is observed, open a new vector for it

   79.                  if (last_to != to){
   80.                      vector = new
    SequentialAccessSparseVector(Cardinality);
   81.                      key.set(to); // open a new vector
   82.                      total++;
   83.                  }
   84.
   85.                  vector.set(from, val);
   86.                  nnz++;
   87.
   88.                  if (nnz % 1000000 == 0){
   89.                    System.out.println("Col" + total + " nnz: "
    + nnz);
   90.                  }
   91.                  last_to = to;
   92.
   93.           } // end while
   94.
   95.            value.set(vector);
   96.            writer.append(key,value);//write last row
   97.            e+= vector.getNumNondefaultElements();
   98.            total++;
   99.
   100.            writer.close();
   101.            System.out.println("Wrote a total of " + total + " cols "
    + " nnz: " + nnz);
   102.            if (e != nnz)
   103.                 System.err.println("Bug:missing edges! we only got"
    + e);
   104.
   105.            System.out.println("Highest column: " + max_to +
   " highest row: " + max_from );
   106.         } catch(Exception ex){
   107.                 ex.printStackTrace();
   108.         }
   109.     }
   110. }



A second option to compile this file is create a Makefile, with the
following in it:
view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>

   1. all:
   2.         javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-
   3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
   /taste-web/target/mahout-taste-webapp-0.5
   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
   /taste-web/target/mahout-taste-webapp-0.5
   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
   -core.jar *.java

Note that you will have the change location of the jars to point to where
your jars are stored.

Example for running this conversion for netflix data:
view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>

   1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1
   .jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
   /taste-web/target/mahout-taste-webapp-0.5
   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
   /taste-web/target/mahout-taste-webapp-0.5
   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
   -core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-
   1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2
   /lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770
    netflixe.seq
   2. Aug 23, 2011 1:16:06
    PM org.apache.hadoop.util.NativeCodeLoader <clinit>
   3. WARNING: Unable to load native-hadoop library for
    your platform... using builtin-java classes where applicable
   4. Aug 23, 2011 1:16:06
    PM org.apache.hadoop.io.compress.CodecPool getCompressor
   5. INFO: Got brand-new compressor
   6. Row241 nnz: 1000000
   7. Row381 nnz: 2000000
   8. Row571 nnz: 3000000
   9. Row789 nnz: 4000000
   10. Row1046 nnz: 5000000
   11. Row1216 nnz: 6000000
   12. Row1441 nnz: 7000000
   13.
   14. ...
   15. </clinit>



2011/9/23 悟统 <junwei.wang@alipay.com>

> Hi,all
> I am studing Mahout. I would like to use SVD in mahout with a matrix,
> The matrix is like this
> 1 0 0 0 0
> 2 4 1 0.5 2
> 2.1 2 4 0 1
> -1.8 2 1 5 1
> 0 3.4 5.9 3 9
>
> How do I to input in Mahout SVD?
>
> ________________________________
>
> This email (including any attachments) is confidential and may be legally
> privileged. If you received this email in error, please delete it
> immediately and do not copy it or use it for any purpose or disclose its
> contents to any other person. Thank you.
>
>
> 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message