mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil" <>
Subject Re: Mahout on EC2
Date Fri, 19 Sep 2008 16:53:49 GMT
Hi Julien,           It would be great if you can test it on the NB/CNB
classifier implementation in Mahout. Could you create a dump of the files in
the directory format (docs of each category resides in its directory)used by
the Mahout NB implementation. There is no need of a separate mapping table
between lexicon and features, as the implementation takes care of features
in text format. Maybe with a good test-train split you can compare it across
various weighting techniques
Robin Anil

On Fri, Sep 19, 2008 at 10:12 PM, Julien Nioche <> wrote:

> Hi,
> I am currently working on the classification of pages according to DMOZ :-)
> I have been planning to give Mahout a serious try but never managed to do
> it
> so that could be a good opportunity to do that.
> We have downloaded and parsed the latest DMOZ snapshot. Everything is
> currently stored in a DB, we have the following fields for each document:
> - URL
> - category (level 1 from DMOZ)
> - content
> - title
> - description (taken from the HTML meta tags)
> - keywords (taken from the HTML meta tags)
> - status (unavailable|fetched)
> We are using our own API to convert the information for each document into
> a
> vector with a choice of which weighting scheme to use (tf-idf, frequency,
> etc...). The weighting takes the fields into account i.e. if using tf.idf
> the weight of a given term takes into account its frequency in this
> specific
> field (say title).
> I could describe the whole process on a Wiki page but that would be quite
> long (especially if we need to go through all the details of Nutch), maybe
> I
> could simply generate a textual representation of the matrix and put it in
> a
> place where people could download it? That could be the starting point of
> the use case. There would also be a lexicon file containing the mapping
> between the attribute labels and their index.
> There could be all sorts of possible experiments from there e.g. trying to
> see which attributes are the most discriminant etc...
> Does that make sense?
> Julien
> 2008/9/19 Grant Ingersoll <>
> > Amazon has generously donated some credits, so I plan on putting Mahout
> up
> > and doing some testing.  Was wondering if people had suggestions on
> things
> > they would like to see from Mahout.  For starters, I'm going to put up a
> > public image containing 0.1 when it's ready, but I'd also like to wiki up
> > some examples.  I.e. go here, get this data, put it in this format and
> then
> > do X.  We have some simple examples, but I think it would be cool to show
> > how to do something a bit more complex, like maybe classify web pages
> > according to DMOZ or to cluster on stuff, or maybe put in a large
> traveling
> > salesman problem using the GA stuff Deneche did.
> >
> > Thoughts?  Anyone else interested in setting up some use cases?
> >
> > -Grant
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message