mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <ssc.o...@googlemail.com>
Subject Re: Generating a Document Similarity Matrix
Date Wed, 09 Jun 2010 18:50:19 GMT
Actually there's no real reason, why vectors couldn't be used except that
the CF data structures use longs  as keys and floats as values in opposite
to ints and doubles on the vector side. But on a first look I think we could
certainly migrate that to use vectors.

-sebastian

2010/6/9 Sean Owen <srowen@gmail.com>

> Nope I'm dreaming. These jobs do use custom output formats. I hadn't
> really looked closely either. (Everything else uses vectors.) Now I
> imagine there is some reason but yeah it would be much better to
> operate in terms of vectors if possible.
>
> Sebastian is there a reason Vectors couldn't be used?
>
> On Wed, Jun 9, 2010 at 7:33 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
> > On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <srowen@gmail.com> wrote:
> >
> >> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
> >> > The ItemSimilarityJob actually uses implementations of the Vector
> >> > class hierarchy?  I think that's the issue - if the on-disk and
> in-mapper
> >> > representations are never Vectors, then they won't interoperate with
> >> > any of the matrix operations...
> >>
> >> Yes they are Vectors.
> >>
> >
> > Oh, I guess I missed that, which step/phase of the ItemSimilarity job
> uses
> > these, on trunk currently?  I don't see any mappers which take in
> > int, vector pairs...
> >
> >
> >> Oh I see. Well that's not a problem. Already, IDs have to be mapped to
> >> ints to be used as dimensions in a Vector. So in most cases things are
> >> keyed by these int pseudo-IDs. That's OK too.
> >>
> >> A matrix is a bunch of vectors -- at least, that's a nice structure
> >> for a SequenceFile. Row (or col) ID mapped to row (column) vector.
> >>
> >> is that not what other jobs are using?
> >> what's the better alternative we could think about converging on.
> >>
> >
> > Yes, as long as the *on HDFS* representation is a
> > SequenceFile<IntWritable,VectorWritable>, we can interoperate.  Or
> > now that you've moved on to VIntWritable, I should migrate the
> distributed
> > matrix stuff to do the same.
> >
> > And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses
> > are reusable and would reduce replicated work as well...
> >
> >  -jake
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message