mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Mahout on EC2
Date Fri, 19 Sep 2008 21:08:02 GMT

More inline.

On Sep 19, 2008, at 12:42 PM, Julien Nioche wrote:

> Hi,
> I am currently working on the classification of pages according to  
> DMOZ :-)
> I have been planning to give Mahout a serious try but never managed  
> to do it
> so that could be a good opportunity to do that.
> We have downloaded and parsed the latest DMOZ snapshot. Everything is
> currently stored in a DB, we have the following fields for each  
> document:
> - URL
> - category (level 1 from DMOZ)
> - content
> - title
> - description (taken from the HTML meta tags)
> - keywords (taken from the HTML meta tags)
> - status (unavailable|fetched)
> We are using our own API to convert the information for each  
> document into a
> vector with a choice of which weighting scheme to use (tf-idf,  
> frequency,
> etc...). The weighting takes the fields into account i.e. if using  
> tf.idf
> the weight of a given term takes into account its frequency in this  
> specific
> field (say title).
> I could describe the whole process on a Wiki page but that would be  
> quite
> long (especially if we need to go through all the details of Nutch),

I think you could just say something like "Go get Nutch and point it  
at X"  The Nutch getting started isn't too hard.

> maybe I
> could simply generate a textual representation of the matrix and put  
> it in a
> place where people could download it?

If that's feasible.  I don't think there would be distribution issues,  
right?  You're just putting up a matrix, not the actual content, but  

> That could be the starting point of
> the use case. There would also be a lexicon file containing the  
> mapping
> between the attribute labels and their index.
> There could be all sorts of possible experiments from there e.g.  
> trying to
> see which attributes are the most discriminant etc...
> Does that make sense?

I think this would be great.

> Julien
> 2008/9/19 Grant Ingersoll <>
>> Amazon has generously donated some credits, so I plan on putting  
>> Mahout up
>> and doing some testing.  Was wondering if people had suggestions on  
>> things
>> they would like to see from Mahout.  For starters, I'm going to put  
>> up a
>> public image containing 0.1 when it's ready, but I'd also like to  
>> wiki up
>> some examples.  I.e. go here, get this data, put it in this format  
>> and then
>> do X.  We have some simple examples, but I think it would be cool  
>> to show
>> how to do something a bit more complex, like maybe classify web pages
>> according to DMOZ or to cluster on stuff, or maybe put in a large  
>> traveling
>> salesman problem using the GA stuff Deneche did.
>> Thoughts?  Anyone else interested in setting up some use cases?
>> -Grant

Grant Ingersoll

Lucene Helpful Hints:

View raw message