mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 18:33:07 GMT
On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <srowen@gmail.com> wrote:

> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
> > The ItemSimilarityJob actually uses implementations of the Vector
> > class hierarchy?  I think that's the issue - if the on-disk and in-mapper
> > representations are never Vectors, then they won't interoperate with
> > any of the matrix operations...
>
> Yes they are Vectors.
>

Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses
these, on trunk currently?  I don't see any mappers which take in
int, vector pairs...


> Oh I see. Well that's not a problem. Already, IDs have to be mapped to
> ints to be used as dimensions in a Vector. So in most cases things are
> keyed by these int pseudo-IDs. That's OK too.
>
> A matrix is a bunch of vectors -- at least, that's a nice structure
> for a SequenceFile. Row (or col) ID mapped to row (column) vector.
>
> is that not what other jobs are using?
> what's the better alternative we could think about converging on.
>

Yes, as long as the *on HDFS* representation is a
SequenceFile<IntWritable,VectorWritable>, we can interoperate.  Or
now that you've moved on to VIntWritable, I should migrate the distributed
matrix stuff to do the same.

And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses
are reusable and would reduce replicated work as well...

  -jake

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message