mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Bellasio <stefanobella...@gmail.com>
Subject Re: How to edit dataset for SVD recommendations with DistributedLanczos?
Date Thu, 16 Dec 2010 11:51:10 GMT
No one can help me? I can open a news thread with my questions, but first of all i want to
do a resume of what i have to do:

1) My goal is to obtain recommendations from Grouplens data set. 
2) I started a series of tests with Mahout and different recommenders as slopeOne, user and
item based
3) The second step is trying to use Hadoop with Mahout, all good with RecommenderJob with
ItemBased in pseudo and distributed mode

4) The last step, i want to use SVD with my grouplens data set, but here i'm completely lost
cause i need some hints to start. I need to "transform" my data set in a matrix i think and
then i need to use DistributedLanczosSolver. All these things seems simple, but at least they
are not, so i'm asking if someone can give me some example or explanations :) 

Thank you, i hope that someone will answer to my questions :) Stefano

Il giorno 06/dic/2010, alle ore 18.34, Derek O'Callaghan ha scritto:

> Yeah, that should work. You can pass in a different array to getSampleData() instead
of DOCS, or change getSampleData() if you want to (i.e. changing the current "for (int i =
0; i < docs2.length; ..." loop body). I think that should be all you need...
> 
> 
> On 06/12/10 17:20, Stefano Bellasio wrote:
>> Thanks :) Found it! Well i think that the part useful for me is this one:
>> 
>>  private List<VectorWritable>  sampleData;
>> 
>>   private String[] termDictionary;
>> 
>>   @Override
>>   @Before
>>   public void setUp() throws Exception {
>>     super.setUp();
>>     Configuration conf = new Configuration();
>>     FileSystem fs = FileSystem.get(conf);
>>     // Create test data
>>     getSampleData(DOCS);
>>     ClusteringTestUtils.writePointsToFile(sampleData, true, getTestTempFilePath("testdata/file1"),
fs, conf);
>>   }
>> 
>>   private void getSampleData(String[] docs2) throws IOException {
>>     sampleData = new ArrayList<VectorWritable>();
>>     RAMDirectory directory = new RAMDirectory();
>>     IndexWriter writer = new IndexWriter(directory,
>>                                          new StandardAnalyzer(Version.LUCENE_30),
>>                                          true,
>>                                          IndexWriter.MaxFieldLength.UNLIMITED);
>>     for (int i = 0; i<  docs2.length; i++) {
>>       Document doc = new Document();
>>       Fieldable id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
>>       doc.add(id);
>>       // Store both position and offset information
>>       Fieldable text = new Field("content", docs2[i], Field.Store.NO, Field.Index.ANALYZED,
Field.TermVector.YES);
>>       doc.add(text);
>>       writer.addDocument(doc);
>>     }
>>     writer.close();
>>     IndexReader reader = IndexReader.open(directory, true);
>>     Weight weight = new TFIDF();
>>     TermInfo termInfo = new CachedTermInfo(reader, "content", 1, 100);
>> 
>>     int numTerms = 0;
>>     for (Iterator<TermEntry>  it = termInfo.getAllEntries(); it.hasNext();)
{
>>       it.next();
>>       numTerms++;
>>     }
>>     termDictionary = new String[numTerms];
>>     int i = 0;
>>     for (Iterator<TermEntry>  it = termInfo.getAllEntries(); it.hasNext();)
{
>>       String term = it.next().term;
>>       termDictionary[i] = term;
>>       System.out.println(i + " " + term);
>>       i++;
>>     }
>>     VectorMapper mapper = new TFDFMapper(reader, weight, termInfo);
>>     Iterable<Vector>  iterable = new LuceneIterable(reader, "id", "content",
mapper);
>> 
>>     i = 0;
>>     for (Vector vector : iterable) {
>>       assertNotNull(vector);
>>       NamedVector namedVector;
>>       if (vector instanceof NamedVector) {
>>         //rename it for testing purposes
>>         namedVector = new NamedVector(((NamedVector) vector).getDelegate(), "P("
+ i + ')');
>> 
>>       } else {
>>         namedVector = new NamedVector(vector, "P(" + i + ')');
>>       }
>>       System.out.println(AbstractCluster.formatVector(namedVector, termDictionary));
>>       sampleData.add(new VectorWritable(namedVector));
>>       i++;
>>     }
>>   }
>> 
>> 
>> Can i pass to sampledata my dataset and then using something like public void testKmeansSVD()
...am i right? Thanks
>> Il giorno 06/dic/2010, alle ore 18.04, Derek O'Callaghan ha scritto:
>> 
>>   
>>> Hi Stefano,
>>> 
>>> The class can be found in mahout-utils/src/test/java.
>>> 
>>> Derek
>>> 
>>> On 06/12/10 16:54, Stefano Bellasio wrote:
>>>     
>>>> Hi Derek, thanks! I'm looking in my mahout files, and i can't find this class
under org.apache.mahout.clustering.TestClusterDumper, is there or another package?
>>>> Il giorno 06/dic/2010, alle ore 14.21, Derek O'Callaghan ha scritto:
>>>> 
>>>> 
>>>>       
>>>>> Hi Stefano,
>>>>> 
>>>>> TestClusterDumper has a few test methods which perform SVD with clustering,
e.g. testKmeansSVD(). These methods demonstrate the creation of a matrix for use with SVD,
so I think they might help to give you an overview of what's required.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Derek
>>>>> 
>>>>> On 04/12/10 18:04, Stefano Bellasio wrote:
>>>>> 
>>>>>         
>>>>>> Do i need to put all data in a matrix i think, but how? I used SVD
command line of Mahout with seqdirectory and seq2sparse, but without success :) Well i think
i need something like this finally: http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html#a1407801
but for recommendations. Can you help me with some suggestions or tutorials? I see that there
is much interest in SVD and DistributedLanczos but really few suggestions and tutorials. Thank
you again for your
>>>>>> 
>>>>>> 
>>>>>>           
>>>> 
>>>>       
>> 
>>   


Mime
View raw message