mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: http://bixolabs.com/datasets/public-terabyte-dataset-project/
Date Fri, 06 Nov 2009 18:49:52 GMT

On Nov 4, 2009, at 1:16 AM, Robin Anil wrote:

> I had always thought we should be using Hadoop to number these  
> features and
> create the vector the way Bayes Classifier does it. In Bayes  
> classifier, I
> don't bother to number the feature. Instead use String=>double  
> mapping. I
> will see If feature numbering could be done by a single map/reduce  
> job. If
> thats the case, We can use the TfIdfDriver to generate the tfidf  
> scores and
> then convert the docs into array(int=>double) vectors. That way it  
> would be
> done in a distributed manner

Ideally, I think we have a bunch of different conversion mechanisms.   
We should probably move the TfIdfDriver out to the Utils module and  
see if it can be made more generic.  We also could use Hadoop M/R jobs  
for the Lucene extraction code, too, so that if you have a bunch of  
indexes in a large scale distributed environment, you can run M/R on  
it to create the vectors.

-Grant

Mime
View raw message