mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <...@occamsmachete.com>
Subject Re: cross recommender
Date Sat, 06 Apr 2013 22:26:33 GMT
I guess I don't understand this issue.

In my case both the item ids and user ids of the separate DistributedRow Matrix will match
and I know the size for the entire space from a previous step where I create id maps. I suppose
you are saying the the m/r code would be super simple if a row of B' and a  column of A could
be processed together, which I understand as an optimal implementation.

So calculating [B'A] seems like TransposeJob and MultiplyJob and does seem to work. You loose
the ability to substutute different RowSimilarityJob measures. I assume this creates something
like the co-occurrence similairty measure. But oh, well. Maybe I'll look at that later.

I also see why you say the two matrices A and B don't have to have the same size since [B'A]H_v
= [B'A]A' so the dimensions will work out as long as the users dimension is the same throughout.


On Apr 6, 2013, at 7:46 AM, Sebastian Schelter <ssc@apache.org> wrote:

Completely concur with that. MatrixMultiplicationJob is already using a
mapside merge-join AFAIK.


On 05.04.2013 15:04, Ted Dunning wrote:
> This may not quite be true because the RSJ is able to take some liberties.
> 
> The origin of these is that A'A can be viewed as a self join.  Thus as rows of A are
read, the cooccurrences can be emitted as they are read.
> 
> For B'A, we have to somehow get corresponding rows of A and B at the same time in the
same place.  If both matrices are stored in sparse row-major form, then a map-side merge join
would work at the cost of some locality.  You can recover that locality in special cases by
a few tricks.  For instance, you might actually store A and B as adjoined rows.  That means
that fetching a row of A inherently also gives a row of B.  Not sure how this could come about.
 
> 
> A second way to get the locality is to use a system like MapR (conflict of interest alert,
vendor specific feature alert, yada yada).  In such a system, you can force files to be co-resident.
 In MapR, this is done by setting chunk size to zero and storing A and B in the same volume.
 This makes that volume only be stored in a single container which forces all of the files
in that volume to have exactly the same replication pattern.  It also makes that volume not
scale as well.  When this is feasible, it can result in a massive speed improvement.  I know
of one site that does this and reportedly achieves 10-20x speed up because of the decrease
in non-local reads.
> 
> A third option is to use a reduce side join.  This would be necessary if A and B were
ever not stored with rows in sequential order and were also not randomly accessible.  I would
avoid this option if possible.
> 
> On Apr 3, 2013, at 10:21 AM, Sebastian Schelter wrote:
> 
>> I don't think you need to run RowSimilarityJob on B'A, I think you would
>> need an equivalent of RowSimilarityJob to compute B'A. I guess you could
>> extends the MatrixMultiplicationJob to use the similarity measures from
>> RowSimilarityJob instead of standard dot products.
> 
> 



Mime
View raw message