Awesome!
Let me know how it goes. I would love to kibitz but can't code anything.
One thing that I did notice is that after computing y = AR, a critical step
is SVD of y. This is best done, perhaps, by decomposing y'y since that is a
k x k matrix.
So ... if R is defined by a PRNG with well defined seeds for each row, then
you don't have to store it. It is particularly nice if you can compute the
nth random number from a particular seed at high speed. If you can do
that, then multiplication by a sparse matrix is very clean. You can
probably do that with most generators by building a moral equivalent of a
skip list ... just save the seed every few hundred elements. Better to have
a generator that can be iterated more efficiently, but usable. Note the
comments in the articles about how very poor generators work just fine.
Now in a MR setting, A will probably be split into rowwise patches. y = AR
is the rowwise concatenation of these. With the PRNG trick, each mapper
can have an implicit copy of R. Otherwise, it will have to be distributed
as a side file to all mappers and a merge will be necessary. But if we are
after y' y as well, we can compute this by taking outer products of each row
of y and summing the resulting k x k matrices. This works wonderfully
because combiners and such can combine partial map outputs. You probably
still have to store y itself, but with a few fancy keys and a multiple
output format, you should be good to go.
On Sun, Sep 27, 2009 at 9:03 PM, Jake Mannix <jake.mannix@gmail.com> wrote:
> That shouldn't be too hard to Hadoopify. I'll see if I can try it out
> while porting over the decomposer stuff.
>

Ted Dunning, CTO
DeepDyve
