mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <sro...@gmail.com>
Subject Re: [jira] [Commented] (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
Date Sun, 22 May 2011 19:35:10 GMT
I think you'll have to push that for 1.0 for now then; 0.20.x doesn't
have map-side joins. Yes that is a blocker for what you're trying to
do and what Sebastian is trying to do for recommendations. I've
already reimplemented recommenders separately with these things and it
simplifies and speeds up the pipeline.


I'd be more against sticking to 0.20.x except that there's already
evidently some issue even getting *on* to 0.20.x in the code, which is
more important to address. And the jump to 0.21.x is a moderate
increase in functionality. To take advantage of it still requires
rewriting everything. Maybe we should wait for an even bigger leap
forward to rewrite everything.


Here's a summary of my recipe for dealing with this in 0.20.x.

First, while you can't have multiple mappers, you can have multiple
input paths. So, you can join two different inputs keyed by the same
keys without trouble, typically with an identity Mapper. Of course,
they have to have the same value class. This is a problem if you want
to join Xs and Ys keyed by the same key.

One solution is to create an "XOrYWritable" which holds either an X or
a Y. Then the jobs that output an X or a Y both output one same value
type, XOrYWritable. See VectorOrPrefWritable for instance.

The Reducer can then check each value to pick out an X or a Y and get both.


In some cases you have to know the ordering, whether you'll get an X
or Y first. In this case you need some cleverness with the key.
Instead of a VarLongWritable for a key, you need something like
"EntityJoinKey" which contains a long value (the ID) but also a
boolean or integer that indicates an ordering. Maybe it adds a boolean
called "before".

It needs to implement WritableComparable and order by the ID value,
but then by the before/after flag.
It also needs to specify a Partitioner which maps keys to the same
reducer if they have the same ID, regardless of before/after flag.

This is fairly convenient because you have a clearer picture of which
values are coming in on "before" keys and then which are coming after.


It's definitely more complex, but it's doable.



On Sun, May 22, 2011 at 8:20 PM, Shannon Quinn <squinn@gatech.edu> wrote:
> What did you have in mind, then, for making matrix multiplication work
> without map-side joins (or at least, in the simple format available in
> 0.18)?

Mime
View raw message