mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Sequence file format for Kmeans, LDA, etc.
Date Fri, 13 Nov 2009 21:26:15 GMT
Decomposer (in the process of donating, just gotta choose what linear
primitives to convert to!) has a DistributedMatrix which does this for the
already-parsed-into SequenceFIle's of Writable Vectors, and I really
like this kind of interface.

Doing things like DistributedMatrix HdfsInputTextMatrix.extractTfIdfCorpus()
where this method sets up and runs a M/R job on a remote cluster, with the
output also living on HDFS, and the handle you have can now do all the
things which a Matrix impl can do... this kind of thing makes using the code
much less like scripting some procedural Jobs, and more like actual OO


On Fri, Nov 13, 2009 at 1:15 PM, Ted Dunning <> wrote:

> This talk combined with previous talk about preferred mode of composing
> tools (script writing using java) is beginning to make me think that we
> need
> something like a HdfsMatrix and LocalFileMatrix which are simply wrappers
> around file names, but which allow extraction of elements (for debugging
> and
> diagnostics and sequential implementations) or for passing to generic
> driver
> routines or receiving from generic conversion routines.
> Should I open a JIRA?
> On Fri, Nov 13, 2009 at 11:54 AM, Grant Ingersoll <
> >wrote:
> > Also, take a look at what the TfIdfDriver does for the classifier stuff.
> >  This is a M/R job for converting text for it's format.  I think we can
> > abstract that to be more general purpose and then move it under the Utils
> > module.  The only thing that likely needs to change is whether we output
> the
> > Writable for the classifier or whether we output a Vector.  That is my
> naive
> > view at this point.
> >
> --
> Ted Dunning, CTO
> DeepDyve

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message