mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitriy Lyubimov (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (MAHOUT-376) Implement Map-reduce version of stochastic SVD
Date Wed, 01 Dec 2010 17:07:12 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965742#action_12965742
] 

Dmitriy Lyubimov edited comment on MAHOUT-376 at 12/1/10 12:05 PM:
-------------------------------------------------------------------

There is a catch though. With this implementation, m is memory bound. It is not in mapper
though but in combiners and mapper of the next step. 

But my reasoning was, with 1G memory and k+p it seems to exist rather wide spectrum of admissible
combinations of r (block height) and minSplitSize (governing essentually # of mappers needed)
that would cover billion rows, and the sweet spot of this combination seem to exceed 1 bln
rows. 

In addition, there are 2 remedies to consider. First is rather straightforward application
of compression on R sequence. 

Second remedy results from the fact that our QR merge process is hierarchical. Right now it's
two-level hierarchy. I.e if processes at each stage merge 1000 q blocks, then at the next
level we can merge another 1000 q blocks, so total number of q blocks is thus approx. 1 mln.
(for 1G memory the number of some 600-800k blocks is more plausible). Assuming Q blocks are
k+p rows high, that gives us aprroximately 300 mln rows for m. But the trick is that if we
have 1G memory in the mapper, then Q blocks don't have to be 500 rows high, then can easily
be 500k rows high. Which immediately puts us in th range of 30 bln rows or so without any
additional remedies, which i think is about the scale of Google's document index in 2005.


But if we add another map-only pass over blocked Q data, then we can have 1 mln blocks with
all the considerations above and that should put us in the range of 30 trillion rows of A.
This number grows exponentially with every added MR pass which is why i am saying m is virtually
unbound. 

Adding these remedies seems to be pretty straighforward, but for a first stab at the problem
my estimates for m bound seem to be adequate. With these kind of numbers, this may easily
become a technology in a search of a problem. We may add some analysis on optimality of combination
of block size and minSplitSize. My initial thought is that finding maximum here is pretty
straigtoward, seems to be a task of finding maximum on a second degree polynomial function.

It's more likely that much more memory would go into precision effort rather than maximizing
m bound though. E.g. instead of covering the scale, the resources may go into increasing precision
and oversampling. In which case additional map-only passes over Q will be tremendously useful.
(imagine this could do k+p=50000 with just one additional map-only pass over Q data.) If this
is the case, then the next low hanging fruit step is to add map-only hierarchical merge of
Rs on Q blocks.

However, it's a stochastic algorithm and as such it is probably not good enough for processes
that would require such precision (e.g certainly not to do math work on rocket boosters, i
think). That said, k+p=50000 probably doesn't make sense. I think applications sharply divide
into 2 categories, where precision requirements are either much higher than that, or much
lower. I can't think of much in between. 



      was (Author: dlyubimov2):
    There is a catch though. With this implementation, m is memory bound. It is not in mapper
though but in combiners and mapper of the next step. 

But my reasoning was, with 1G memory and k+p it seems to exist rather wide spectrum of admissible
combinations of r (block height) and minSplitSize (governing essentually # of mappers needed)
that would cover billion rows, and the sweet spot of this combination seem to exceed 1 bln
rows. 

In addition, there are 2 remedies to consider. First is rather straightforward application
of compression on R sequence. 

Second remedy results from the fact that our QR merge process is hierarchical. Right now it's
two-level hierarchy. I.e if processes at each stage merge 1000 q blocks, then at the next
level we can merge another 1000 q blocks, so total number of q blocks is thus approx. 1 mln.
(for 1G memory the number of some 600-800k blocks is more plausible). Assuming Q blocks are
k+p rows high, that gives us aprroximately 300 mln rows for m. But the trick is that if we
have 1G memory in the mapper, then Q blocks don't have to be 500 rows high, then can easily
be 500k rows high. Which immediately puts us in th range of 30 bln rows or so. 

But if we add another map-only pass over blocked Q data, then we can have 1 mln blocks with
all the considerations above and that should put us in the range of 30 trillion rows of A.
This number grows exponentially with every added MR pass which is why i am saying m is virtually
unbound. 

Adding these remedies seems to be pretty straighforward, but for a first stab at the problem
my estimates for m bound seem to be adequate. With these kind of numbers, this may easily
become a technology in a search of a problem. We may add some analysis on optimality of combination
of block size and minSplitSize. My initial thought is that finding maximum here is pretty
straigtoward, seems to be a task of finding maximum on a second degree polynomial function.

It's more likely that much more memory would go into precision effort rather than maximizing
m bound though. E.g. instead of covering the scale, the resources may go into increasing precision
and oversampling. In which case additional map-only passes over Q will be tremendously useful.
(imagine this could do k+p=50000 with just one additional map-only pass over Q data.) If this
is the case, then the next low hanging fruit step is to add map-only hierarchical merge of
Rs on Q blocks.

However, it's a stochastic algorithm and as such it is probably not good enough for processes
that would require such precision (e.g certainly not to do math work on rocket boosters, i
think). That said, k+p=50000 probably doesn't make sense. I think applications sharply divide
into 2 categories, where precision requirements are either much higher than that, or much
lower. I can't think of much in between. 


  
> Implement Map-reduce version of stochastic SVD
> ----------------------------------------------
>
>                 Key: MAHOUT-376
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-376
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.5
>
>         Attachments: MAHOUT-376.patch, Modified stochastic svd algorithm for mapreduce.pdf,
QR decomposition for Map.pdf, QR decomposition for Map.pdf, QR decomposition for Map.pdf,
sd-bib.bib, sd.pdf, sd.pdf, sd.pdf, sd.pdf, sd.tex, sd.tex, sd.tex, sd.tex, SSVD working notes.pdf,
SSVD working notes.pdf, SSVD working notes.pdf, ssvd-CDH3-or-0.21.patch.gz, ssvd-m1.patch.gz,
ssvd-m2.patch.gz, ssvd-m3.patch.gz, Stochastic SVD using eigensolver trick.pdf
>
>
> See attached pdf for outline of proposed method.
> All comments are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message