mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: distributed svd
Date Fri, 23 Jul 2010 23:09:55 GMT
Hi Lazlo,

On Fri, Jul 23, 2010 at 1:14 AM, Laszlo Dosa <laszlo.dosa@fredhopper.com>
wrote:

I tried to calculate SVD based recommendation by:

1)      Calculate A'A

2)      Calculate V_k by using Lanczos to get top k eigenvectors of A'A

 The SVD implementation in Mahout does not require you do these two steps
separately - they happen at the same time if you pass in the A matrix to
the DistributedLanczosSolver directly.

3)      Calculate preferences for each user by Vector usersPreference =
V_k.timesSquared(row of A);

 What you want to do here is actually take the A matrix, represented by a
DistributedRowMatrix, and the V_k matrix (you actually get basically V_k's
transposed from the SVD job), represented by that same class, and then just
compute A . V . S^-1, where S is the diagonal matrix of singular values
(note that these are the sqrts of the eigenvalues you get out of Lancos).

A . V can be computed in one mapreduce job (DistributedRowMatrix.times() ),
so that's the way to do that efficiently.

  -jake


4)      Calculate 20 recommendation for each user by selecting the top 20
preference from usersPreference

The number of users and items is between 100 000 and 300 000.

In my experience Step 3) scales linearly by the number of users.

Each job runs fast, but I have ~100 000 jobs. This slows down the whole
calculation.

Is the approach wrong that I use?

Or can you suggest something to  make this faster?

Regards,

Laszlo

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message