mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: LYRL2004/RCV1 input for Classification?
Date Mon, 28 Jun 2010 20:22:43 GMT
I think the chance that Mahout will use the RCV1 data as-is is pretty near
zero.  The issue is that RCV1 uses the TREC convention of separate files for
documents and relevance judgements (largely because relevance to multiple
queries is quite plausible  in most of the TREC tasks).

That said, it doesn't take more than a few lines of glue to smash RCV1 into
one of the several formats that we use in Mahout.

The real problem is that input formats are not real consistent yet across
the different supervised learning programs in Mahout.  Naive Bayes, Random
Forests, SGD and SVM all use inputs that they inherited from their original
applications.  There is a bit of motion afoot to converge these systems, but
you can definitely help there.

On Mon, Jun 28, 2010 at 1:06 PM, Brandon Mensing <>wrote:

> Has anyone used the LYRL2004 RCV1 data for input to classification? I'm
> trying to determine if it's possible to plug it into the given
> classification training without significant modification to the source or
> the data.
> Thanks
> Brandon Mensing

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message