mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From myn <...@163.com>
Subject Re:Re: Re: is there some place to study Singular Value Decomposition algorithms
Date Mon, 29 Aug 2011 11:03:59 GMT
the best way is to read the sorce code ;
 
@_@




At 2011-08-29 16:02:57,"Lance Norskog" <goksron@gmail.com> wrote:
>'R' also has an svd implementation, directly in the base package.
>
>There are a few answers to your question:
>1) What is SVD? The video lecture above will help. Also, searching for
>'singular value decomposition' on Baidu finds a lot of basic explanations.
>2) Why do you want it? It creates in on pass a few different unique
>explanations of what is going on inside your dataset.
>3) Mahout Distributed Matrix code, DistributedLanczos etc. are
>implementations specifically for large-scale problems. There are sub-parts
>of SVD that you may not need for your problem, and these jobs avoid some of
>the work.
>
>Until you have a solid grasp of what SVD can tell you, there is no point
>trying the distributed mahout jobs. The SingularValueDecomposition class in
>Mahout has served me well in my researches.
>
>Lance
>
>On Mon, Aug 29, 2011 at 12:50 AM, Danny Bickson <danny.bickson@gmail.com>wrote:
>
>>  Mahout - SVD matrix factorization - formatting input matrix
>>  Converting Input Format into Mahout's SVD Distributed Matrix Factorization
>> Solver
>>
>> Purpose
>> The code below, converts a matrix from csv format:
>> <from row>,<to col>,<value>\n
>> Into Mahout's SVD solver format.
>>
>>
>> For example,
>> The 3x3 matrix:
>> 0    1.0 2.1
>> 3.0  4.0 5.0
>> -5.0 6.2 0
>>
>>
>> Will be given as input in a csv file as:
>> 1,0,3.0
>> 2,0,-5.0
>> 0,1,1.0
>> 1,1,4.0
>> 2,1,6.2
>> 0,2,2.1
>> 1,2,5.0
>>
>> NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
>> This code is based on code by Danny Leshem, ContextIn.
>>
>> Command line arguments:
>>  args[0] - path to csv input file
>> args[1] - cardinality of the matrix (number of columns)
>> args[2] - path the resulting Mahout's SVD input file
>>
>> Method:
>> The code below, goes over the csv file, and for each matrix column, creates
>> a SequentialAccessSparseVector which contains all the non-zero row entries
>> for this column.
>> Then it appends the column vector to file.
>>
>> Compilation:
>> Copy the java code below into an java file named Convert2SVD.java
>> Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a
>> command line option for compilation is given below.
>>
>>
>> view plain<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> print<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> ?<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>>
>>   1. import java.io.BufferedReader;
>>   2. import java.io.FileReader;
>>   3. import java.util.StringTokenizer;
>>   4.
>>   5. import org.apache.mahout.math.SequentialAccessSparseVector;
>>   6. import org.apache.mahout.math.Vector;
>>   7. import org.apache.mahout.math.VectorWritable;
>>   8. import org.apache.hadoop.conf.Configuration;
>>   9. import org.apache.hadoop.fs.FileSystem;
>>   10. import org.apache.hadoop.fs.Path;
>>   11. import org.apache.hadoop.io.IntWritable;
>>   12. import org.apache.hadoop.io.SequenceFile;
>>   13. import org.apache.hadoop.io.SequenceFile.CompressionType;
>>   14.
>>   15. /**
>>   16.  * Code for converting CSV format to Mahout's SVD format
>>   17.  * @author Danny Bickson, CMU
>>   18.
>>    * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
>> SECOND FIELD).
>>
>>   19.  *
>>   20.  */
>>   21.
>>   22. public class Convert2SVD {
>>   23.
>>   24.
>>   25.         public static int Cardinality;
>>   26.
>>   27.         /**
>>   28.          *
>>   29.          * @param args[0] - input csv file
>>   30.          * @param args[1] - cardinality (length of vector)
>>   31.          * @param args[2] - output file for svd
>>   32.          */
>>   33.         public static void main(String[] args){
>>   34.
>>   35. try {
>>   36.         Cardinality = Integer.parseInt(args[1]);
>>   37.         final Configuration conf = new Configuration();
>>   38.         final FileSystem fs = FileSystem.get(conf);
>>   39.         final
>>    SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new
>>    Path(args[2]), IntWritable.class, VectorWritable.class
>>   , CompressionType.BLOCK);
>>   40.
>>   41.           final IntWritable key = new IntWritable();
>>   42.           final VectorWritable value = new VectorWritable();
>>   43.
>>   44.
>>   45.            String thisLine;
>>   46.
>>   47.            BufferedReader br = new BufferedReader(new
>>    FileReader(args[0]));
>>   48.            Vector vector = null;
>>   49.            int from = -1,to  =-1;
>>   50.            int last_to = -1;
>>   51.            float val = 0;
>>   52.            int total = 0;
>>   53.            int nnz = 0;
>>   54.            int e = 0;
>>   55.            int max_to =0;
>>   56.            int max_from = 0;
>>   57.
>>   58.            while ((thisLine = br.readLine()) != null) {
>>   // while loop begins here
>>   59.
>>   60.                  StringTokenizer st = new StringTokenizer(thisLine,
>>   ",");
>>   61.                  while(st.hasMoreTokens()) {
>>   62.                      from = Integer.parseInt(st.nextToken())-1;
>>   //convert from 1 based to zero based
>>   63.                      to = Integer.parseInt(st.nextToken())-1;
>>   //convert from 1 based to zero basd
>>   64.                      val = Float.parseFloat(st.nextToken());
>>   65.                      if (max_from < from) max_from = from;
>>   66.                      if (max_to < to) max_to = to;
>>   67.                      if (from < 0 || to < 0
>>    || to > Cardinality || val == 0.0)
>>   68.                          throw new NumberFormatException("wrong data"
>>    + from + " to: " + to + " val: " + val);
>>   69.                  }
>>   70.
>>   71.
>>   //we are working on an existing column, set non-zero rows in it
>>   72.                  if (last_to != to && last_to != -1){
>>   73.                      value.set(vector);
>>   74.
>>   75.                      writer.append(key, value);
>>   //write the older vector
>>   76.                      e+= vector.getNumNondefaultElements();
>>   77.                  }
>>   78.                  //a new column is observed, open a new vector for it
>>
>>   79.                  if (last_to != to){
>>   80.                      vector = new
>>    SequentialAccessSparseVector(Cardinality);
>>   81.                      key.set(to); // open a new vector
>>   82.                      total++;
>>   83.                  }
>>   84.
>>   85.                  vector.set(from, val);
>>   86.                  nnz++;
>>   87.
>>   88.                  if (nnz % 1000000 == 0){
>>   89.                    System.out.println("Col" + total + " nnz: "
>>    + nnz);
>>   90.                  }
>>   91.                  last_to = to;
>>   92.
>>   93.           } // end while
>>   94.
>>   95.            value.set(vector);
>>   96.            writer.append(key,value);//write last row
>>   97.            e+= vector.getNumNondefaultElements();
>>   98.            total++;
>>   99.
>>   100.            writer.close();
>>   101.            System.out.println("Wrote a total of " + total + " cols "
>>    + " nnz: " + nnz);
>>   102.            if (e != nnz)
>>   103.                 System.err.println("Bug:missing edges! we only got"
>>    + e);
>>   104.
>>   105.            System.out.println("Highest column: " + max_to +
>>   " highest row: " + max_from );
>>   106.         } catch(Exception ex){
>>   107.                 ex.printStackTrace();
>>   108.         }
>>   109.     }
>>   110. }
>>
>> import java.io.BufferedReader;
>> import java.io.FileReader;
>> import java.util.StringTokenizer;
>>
>> import org.apache.mahout.math.SequentialAccessSparseVector;
>> import org.apache.mahout.math.Vector;
>> import org.apache.mahout.math.VectorWritable;
>> import org.apache.hadoop.conf.Configuration;
>> import org.apache.hadoop.fs.FileSystem;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.IntWritable;
>> import org.apache.hadoop.io.SequenceFile;
>> import org.apache.hadoop.io.SequenceFile.CompressionType;
>>
>> /**
>>  * Code for converting CSV format to Mahout's SVD format
>>  * @author Danny Bickson, CMU
>>  * Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
>> SECOND FIELD).
>>  *
>>  */
>>
>> public class Convert2SVD {
>>
>>
>>        public static int Cardinality;
>>
>>        /**
>>         *
>>         * @param args[0] - input csv file
>>         * @param args[1] - cardinality (length of vector)
>>         * @param args[2] - output file for svd
>>         */
>>        public static void main(String[] args){
>>
>> try {
>>        Cardinality = Integer.parseInt(args[1]);
>>        final Configuration conf = new Configuration();
>>        final FileSystem fs = FileSystem.get(conf);
>>        final SequenceFile.Writer writer =
>> SequenceFile.createWriter(fs, conf, new Path(args[2]),
>> IntWritable.class, VectorWritable.class, CompressionType.BLOCK);
>>
>>          final IntWritable key = new IntWritable();
>>          final VectorWritable value = new VectorWritable();
>>
>>
>>           String thisLine;
>>
>>           BufferedReader br = new BufferedReader(new FileReader(args[0]));
>>           Vector vector = null;
>>           int from = -1,to  =-1;
>>           int last_to = -1;
>>           float val = 0;
>>           int total = 0;
>>           int nnz = 0;
>>           int e = 0;
>>           int max_to =0;
>>           int max_from = 0;
>>
>>           while ((thisLine = br.readLine()) != null) { // while loop
>> begins here
>>
>>                 StringTokenizer st = new StringTokenizer(thisLine, ",");
>>                 while(st.hasMoreTokens()) {
>>                     from = Integer.parseInt(st.nextToken())-1;
>> //convert from 1 based to zero based
>>                     to = Integer.parseInt(st.nextToken())-1;
>> //convert from 1 based to zero basd
>>                     val = Float.parseFloat(st.nextToken());
>>                     if (max_from < from) max_from = from;
>>                     if (max_to < to) max_to = to;
>>                     if (from < 0 || to < 0 || to > Cardinality || val ==
>> 0.0)
>>                         throw new NumberFormatException("wrong data"
>> + from + " to: " + to + " val: " + val);
>>                 }
>>
>>                 //we are working on an existing column, set non-zero rows
>> in it
>>                 if (last_to != to && last_to != -1){
>>                     value.set(vector);
>>
>>                     writer.append(key, value); //write the older vector
>>                     e+= vector.getNumNondefaultElements();
>>                 }
>>                 //a new column is observed, open a new vector for it
>>                 if (last_to != to){
>>                     vector = new SequentialAccessSparseVector(Cardinality);
>>                     key.set(to); // open a new vector
>>                     total++;
>>                 }
>>
>>                 vector.set(from, val);
>>                 nnz++;
>>
>>                 if (nnz % 1000000 == 0){
>>                   System.out.println("Col" + total + " nnz: " + nnz);
>>                 }
>>                 last_to = to;
>>
>>          } // end while
>>
>>           value.set(vector);
>>           writer.append(key,value);//write last row
>>           e+= vector.getNumNondefaultElements();
>>           total++;
>>
>>           writer.close();
>>           System.out.println("Wrote a total of " + total + " cols " +
>> " nnz: " + nnz);
>>           if (e != nnz)
>>                System.err.println("Bug:missing edges! we only got" + e);
>>
>>           System.out.println("Highest column: " + max_to + " highest
>> row: " + max_from );
>>        } catch(Exception ex){
>>                ex.printStackTrace();
>>        }
>>    }
>> }
>>
>>
>>
>> A second option to compile this file is create a Makefile, with the
>> following in it:
>> view plain<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> print<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> ?<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>>
>>   1. all:
>>   2.         javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-
>>   3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>>   /taste-web/target/mahout-taste-webapp-0.5
>>   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
>>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>>   /taste-web/target/mahout-taste-webapp-0.5
>>   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
>>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
>>   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
>>   -core.jar *.java
>>
>> all:
>>        javac -cp
>>
>> /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar
>> *.java
>>
>> Note that you will have the change location of the jars to point to where
>> your jars are stored.
>>
>> Example for running this conversion for netflix data:
>> view plain<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> print<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>> ?<
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
>>
>>   1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1
>>   .jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>>   /taste-web/target/mahout-taste-webapp-0.5
>>   -SNAPSHOT/WEB-INF/lib/mahout-core-0.5
>>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
>>   /taste-web/target/mahout-taste-webapp-0.5
>>   -SNAPSHOT/WEB-INF/lib/mahout-math-0.5
>>   -SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
>>   1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
>>   -core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-
>>   1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2
>>   /lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770
>>    netflixe.seq
>>   2. Aug 23, 2011 1:16:06
>>    PM org.apache.hadoop.util.NativeCodeLoader <clinit>
>>   3. WARNING: Unable to load native-hadoop library for
>>    your platform... using builtin-java classes where applicable
>>   4. Aug 23, 2011 1:16:06
>>    PM org.apache.hadoop.io.compress.CodecPool getCompressor
>>   5. INFO: Got brand-new compressor
>>   6. Row241 nnz: 1000000
>>   7. Row381 nnz: 2000000
>>   8. Row571 nnz: 3000000
>>   9. Row789 nnz: 4000000
>>   10. Row1046 nnz: 5000000
>>   11. Row1216 nnz: 6000000
>>   12. Row1441 nnz: 7000000
>>   13.
>>   14. ...
>>   15. </clinit>
>>
>> java -cp
>> .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-core-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4/taste-web/target/mahout-taste-webapp-0.5-SNAPSHOT/WEB-INF/lib/mahout-math-0.5-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-api-1.0.4.jar
>> Convert2SVD ../../netflixe.csv 17770 netflixe.seq
>> Aug 23, 2011 1:16:06 PM org.apache.hadoop.util.NativeCodeLoader
>> WARNING: Unable to load native-hadoop library for your platform...
>> using builtin-java classes where applicable
>> Aug 23, 2011 1:16:06 PM org.apache.hadoop.io.compress.CodecPool
>> getCompressor
>> INFO: Got brand-new compressor
>> Row241 nnz: 1000000
>> Row381 nnz: 2000000
>> Row571 nnz: 3000000
>> Row789 nnz: 4000000
>> Row1046 nnz: 5000000
>> Row1216 nnz: 6000000
>> Row1441 nnz: 7000000
>>
>> ...
>>
>>
>> NOTE: You may want also to checkout GraphLab's collaborative filtering
>> library: here <http://graphlab.org/pmf.html>. GraphLab has a 100%
>> compatible
>> SVD solver to Mahout, with performance gains up to x50 times faster. I have
>> created Java code to convert Mahout sequence files into Graphlab's format
>> and back. Email me and I will send you the code.
>>
>> 2011/8/29 myn <myn@163.com>
>>
>> > thanks
>> > But could you send the content ofhttp://
>> > bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html to me
>> ?
>> > I can`t open it  in china .
>> >
>> >
>> >
>> >
>> >
>> > At 2011-08-29 15:29:40,"Danny Bickson" <danny.bickson@gmail.com> wrote:
>> > >Command line arguments are found here:
>> > >https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
>> > >I wrote a quick tutorial on how to prepare sparse matrices as input to
>> > >Mahout SVD here:
>> > >
>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>> > >
>> > >Let me know if you have further questions.
>> > >
>> > >2011/8/29 myn <myn@163.com>
>> > >
>> > >> i want to study Singular Value Decomposition algorithms;
>> > >> I also have a book called mahout in action,but i can`t found sth about
>> > this
>> > >> algorithm;
>> > >> is there someplace introduce how to use the method?
>> > >> till now DistributedLanczosSolver  is not a mapreduce method
>> > >> org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver =
>> svd
>> >
>>
>
>
>
>-- 
>Lance Norskog
>goksron@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message