mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Non-compatible mapper keys between LDADriver and CVB0Driver
Date Tue, 24 Jan 2012 21:48:38 GMT
In general, workflows with matrices in Mahout handle
SequenceFile<IntWritable, VectorWritable>, as this is the on-disk format of
the class DistributedRowMatrix.  The original Mahout LDA pre-dated this
move to standardize closer to that format, and so it didn't have that
requirement.

Now, as you say, it's true that in the new implementation, the keys aren't
actually
used, so in principle we could just go with WritableComparable<?> in
CVB0Driver's
mappers/reducers keys.  In fact, it would make certain integrations a
little nicer,
at the cost of pushing incompatibility somewhere else.  For example, the
output
p(document | topic) distributions go into a SequenceFile whose keys are the
same
as the input corpus keys (ie the doc_id values), and there may be workflows
which
take this matrix and transpose it to multiply it by another matrix or
somethign of that
nature.  If the keys are IntWritable, this all works just fine.  If not,
then transpose
will fail horribly, as will matrix multiplication.

Standardizing on a common fixed format internally avoids some of these
problems,
while at the same time being a bit inflexible.

It's possible we could add a command-line option + some internal switches
to allow
the user to explicitly force untyped keys, or just warn on non-integer keys
or
something...

I can just imagine seeing the questions on this very list when someone
takes the output
of their Long-keyed corpus and try to matrix multiply it by some other
matrix...

  -jake

On Tue, Jan 24, 2012 at 1:27 PM, John Conwell <john@iamjohn.me> wrote:

> I wanted to compare the two LDA implementations, and I noticed that for the
> input corpus sequence file file (key: doc_id, value: vector), the Key for
> the input file for LDADriver takes any WritableComparable<?> key, but the
> Key for the input file for CVB0Driver requires IntWritable explicitly.  Is
> there some reason these two LDA implementations cant both use
> WritableComparable<?> for the key of the input sequence file?  It would
> make integrating them into application workflows much easier and
> consistant.
>
> --
>
> Thanks,
> John C
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message