mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD too slow to handle large matrix?
Date Fri, 14 Sep 2012 21:48:47 GMT
for that problem, like i said, just drop all the defaults.

most importantly, give child processes enough memory without hitting
the swap. Hadoop default used to be 200m only (don't know about now).
That surely will cause GC thrashing and slow turnaround (if it goes
thru at all).

On Fri, Sep 14, 2012 at 2:47 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> yeah sounds that something is wrong. 300mb is not huge. I have
> problems of around 2G input on 10 nodes and it doesn't take that long
> at all. Another researcher i knew was doign something similar in
> vicinity i think of 4-5B non-zeros.
>
> On Fri, Sep 14, 2012 at 2:41 PM, lei tang <find.ltang@gmail.com> wrote:
>> there are around 100M non-zero entries.  The sequence file size is not that
>> huge, around 300M bytes.
>>
>> i'll check out your other options to see what is wrong.
>>
>> - Lei
>>
>> On Fri, Sep 14, 2012 at 2:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>>
>>> most importantly, what's your number of non-zero elements. (or input
>>> sequence file size).
>>>
>>> On Fri, Sep 14, 2012 at 2:19 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> > Q job is actually the fastest and map-only.I'd say you drop all the
>>> > optional parameters (including p) and use mahout 0.7.
>>> >
>>> > Actually reducing split size is unlikely to help. Default split should
>>> be fine.
>>> >
>>> > i'd say running -k 10 on any sized input should result in Q mapper
>>> > task running in at most couple of minutes.
>>> >
>>> > using -k200 -p100 is fairly ambitious (mapper task running time will
>>> > scale a little worse then proportional to k+p).
>>> >
>>> > if you use -q1 you will likely to have more problems with ABt job and
>>> > that may require some memory tuning...
>>> >
>>> > otherwise check the usual things -- memory, cluster capacity (do you
>>> > actually have capacity running 100 mappers? Do they have at least 1G
>>> > of RAM on -Xmx without scratching the swap? Are you seeing GC
>>> > thrashing? etc.)
>>> >
>>> > That said your problem doesn't seem too big (judging from 100 mappers
>>> > with a regular split size, that should be ok). with -k 100 and default
>>> > p you should expect single q task to run about 20-25 minutes,
>>> > depending on your hardware. It is cpu-bound (or rather, mostly
>>> > fpu-bound, assuming you tackled memory issues etc.)
>>> >
>>> >
>>> > On Fri, Sep 14, 2012 at 1:24 PM, lei tang <find.ltang@gmail.com> wrote:
>>> >> Hi,
>>> >>
>>> >> I am using mahout's  SSVD (stochastic SVD) to factorize a huge sparse
>>> >> matrix (around 30M x 1M).    I used a modified script of
>>> >>
>>> http://bickson.blogspot.com/2011/02/mahout-svd-matrix-factorization.html
>>> >> to store the input matrix with <key, value> pairs being integer,
and
>>> >> vectorwritable (in particular, SequentialAccessSparseVector). Should
I
>>> >> change to RandomAccessSparseVector?
>>> >>
>>> >> I managed to run mahout SSVD with the following specification.
>>> >> mahout ssvd -Dmapred.max.split.size=1000000 -i mf/tr_full.seq -o
>>> >> mf/out_full -k 200 -p 100 -r 100000 -U true -V true -t 20 --tempDir
>>> mf/tmp
>>> >>
>>> >> I specified the max split in order to have more mappers.  However, the
>>> >> first Qjob seems not moving. After 1 hour, it is still 12% with 100
>>> >> mappers.  Is this expected?  Should I change any parameter?
>>> >>
>>> >> Any suggestion is highly appreciated.
>>> >>
>>> >> - Lei
>>> >> P.S.  I'm also reading the docs from
>>> >> https://issues.apache.org/jira/browse/MAHOUT-376  in hope that I can
>>> figure
>>> >> out why it is so slow.
>>>

Mime
View raw message