Hi!
You can find detailed Java code to convert your example to Mahout SVD format
on my blog here:
http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
Since I know some Chinese users a blocked to goole websites, here is the
content:
Best,
Danny Bickson
Friday, February 4, 2011 Mahout - SVD matrix factorization -
formatting input matrix
Converting Input Format into Mahout's SVD Distributed Matrix Factorization
Solver
Purpose
The code below, converts a matrix from csv format:
<from row>,<to col>,<value>\n
Into Mahout's SVD solver format.
For example,
The 3x3 matrix:
0 1.0 2.1
3.0 4.0 5.0
-5.0 6.2 0
Will be given as input in a csv file as:
1,0,3.0
2,0,-5.0
0,1,1.0
1,1,4.0
2,1,6.2
0,2,2.1
1,2,5.0
NOTE: I ASSUME THE MATRIX IS SORTED BY THE COLUMNS ORDER
This code is based on code by Danny Leshem, ContextIn.
Command line arguments:
args[0] - path to csv input file
args[1] - cardinality of the matrix (number of columns)
args[2] - path the resulting Mahout's SVD input file
Method:
The code below, goes over the csv file, and for each matrix column, creates
a SequentialAccessSparseVector which contains all the non-zero row entries
for this column.
Then it appends the column vector to file.
Compilation:
Copy the java code below into an java file named Convert2SVD.java
Add to your IDE project path both Mahout and Hadoop jars. Alternatively, a
command line option for compilation is given below.
view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
1. import java.io.BufferedReader;
2. import java.io.FileReader;
3. import java.util.StringTokenizer;
4.
5. import org.apache.mahout.math.SequentialAccessSparseVector;
6. import org.apache.mahout.math.Vector;
7. import org.apache.mahout.math.VectorWritable;
8. import org.apache.hadoop.conf.Configuration;
9. import org.apache.hadoop.fs.FileSystem;
10. import org.apache.hadoop.fs.Path;
11. import org.apache.hadoop.io.IntWritable;
12. import org.apache.hadoop.io.SequenceFile;
13. import org.apache.hadoop.io.SequenceFile.CompressionType;
14.
15. /**
16. * Code for converting CSV format to Mahout's SVD format
17. * @author Danny Bickson, CMU
18.
* Note: I ASSUME THE CSV FILE IS SORTED BY THE COLUMN (NAMELY THE
SECOND FIELD).
19. *
20. */
21.
22. public class Convert2SVD {
23.
24.
25. public static int Cardinality;
26.
27. /**
28. *
29. * @param args[0] - input csv file
30. * @param args[1] - cardinality (length of vector)
31. * @param args[2] - output file for svd
32. */
33. public static void main(String[] args){
34.
35. try {
36. Cardinality = Integer.parseInt(args[1]);
37. final Configuration conf = new Configuration();
38. final FileSystem fs = FileSystem.get(conf);
39. final
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, new
Path(args[2]), IntWritable.class, VectorWritable.class
, CompressionType.BLOCK);
40.
41. final IntWritable key = new IntWritable();
42. final VectorWritable value = new VectorWritable();
43.
44.
45. String thisLine;
46.
47. BufferedReader br = new BufferedReader(new
FileReader(args[0]));
48. Vector vector = null;
49. int from = -1,to =-1;
50. int last_to = -1;
51. float val = 0;
52. int total = 0;
53. int nnz = 0;
54. int e = 0;
55. int max_to =0;
56. int max_from = 0;
57.
58. while ((thisLine = br.readLine()) != null) {
// while loop begins here
59.
60. StringTokenizer st = new StringTokenizer(thisLine,
",");
61. while(st.hasMoreTokens()) {
62. from = Integer.parseInt(st.nextToken())-1;
//convert from 1 based to zero based
63. to = Integer.parseInt(st.nextToken())-1;
//convert from 1 based to zero basd
64. val = Float.parseFloat(st.nextToken());
65. if (max_from < from) max_from = from;
66. if (max_to < to) max_to = to;
67. if (from < 0 || to < 0
|| to > Cardinality || val == 0.0)
68. throw new NumberFormatException("wrong data"
+ from + " to: " + to + " val: " + val);
69. }
70.
71.
//we are working on an existing column, set non-zero rows in it
72. if (last_to != to && last_to != -1){
73. value.set(vector);
74.
75. writer.append(key, value);
//write the older vector
76. e+= vector.getNumNondefaultElements();
77. }
78. //a new column is observed, open a new vector for it
79. if (last_to != to){
80. vector = new
SequentialAccessSparseVector(Cardinality);
81. key.set(to); // open a new vector
82. total++;
83. }
84.
85. vector.set(from, val);
86. nnz++;
87.
88. if (nnz % 1000000 == 0){
89. System.out.println("Col" + total + " nnz: "
+ nnz);
90. }
91. last_to = to;
92.
93. } // end while
94.
95. value.set(vector);
96. writer.append(key,value);//write last row
97. e+= vector.getNumNondefaultElements();
98. total++;
99.
100. writer.close();
101. System.out.println("Wrote a total of " + total + " cols "
+ " nnz: " + nnz);
102. if (e != nnz)
103. System.err.println("Bug:missing edges! we only got"
+ e);
104.
105. System.out.println("Highest column: " + max_to +
" highest row: " + max_from );
106. } catch(Exception ex){
107. ex.printStackTrace();
108. }
109. }
110. }
A second option to compile this file is create a Makefile, with the
following in it:
view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
1. all:
2. javac -cp /mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-
3.1.1.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
/taste-web/target/mahout-taste-webapp-0.5
-SNAPSHOT/WEB-INF/lib/mahout-core-0.5
-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
/taste-web/target/mahout-taste-webapp-0.5
-SNAPSHOT/WEB-INF/lib/mahout-math-0.5
-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
-core.jar *.java
Note that you will have the change location of the jars to point to where
your jars are stored.
Example for running this conversion for netflix data:
view plain<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
print<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
?<http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html#>
1. java -cp .:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/core-3.1.1
.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
/taste-web/target/mahout-taste-webapp-0.5
-SNAPSHOT/WEB-INF/lib/mahout-core-0.5
-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/mahout-0.4
/taste-web/target/mahout-taste-webapp-0.5
-SNAPSHOT/WEB-INF/lib/mahout-math-0.5
-SNAPSHOT.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-cli-
1.2.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/hadoop-0.20.2
-core.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2/lib/commons-logging-
1.0.4.jar:/mnt/bigbrofs/usr7/bickson/hadoop-0.20.2
/lib/commons-logging-api-1.0.4.jar Convert2SVD ../../netflixe.csv 17770
netflixe.seq
2. Aug 23, 2011 1:16:06
PM org.apache.hadoop.util.NativeCodeLoader <clinit>
3. WARNING: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
4. Aug 23, 2011 1:16:06
PM org.apache.hadoop.io.compress.CodecPool getCompressor
5. INFO: Got brand-new compressor
6. Row241 nnz: 1000000
7. Row381 nnz: 2000000
8. Row571 nnz: 3000000
9. Row789 nnz: 4000000
10. Row1046 nnz: 5000000
11. Row1216 nnz: 6000000
12. Row1441 nnz: 7000000
13.
14. ...
15. </clinit>
2011/9/23 悟统 <junwei.wang@alipay.com>
> Hi,all
> I am studing Mahout. I would like to use SVD in mahout with a matrix,
> The matrix is like this
> 1 0 0 0 0
> 2 4 1 0.5 2
> 2.1 2 4 0 1
> -1.8 2 1 5 1
> 0 3.4 5.9 3 9
>
> How do I to input in Mahout SVD?
>
> ________________________________
>
> This email (including any attachments) is confidential and may be legally
> privileged. If you received this email in error, please delete it
> immediately and do not copy it or use it for any purpose or disclose its
> contents to any other person. Thank you.
>
>
> 本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。
>
|