mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: SSVD fails on seq2sparse output.
Date Thu, 15 Nov 2012 20:56:45 GMT
On Thu, Nov 15, 2012 at 12:09 PM, Abramov Pavel <p.abramov@rambler-co.ru>wrote:

> Dmitriy,
>
>
> 3) I can apply SSVD on a sample (0,1% of my data). But it fails with 100%
> of data. (Bt-job stops on a Map phase with "Java heap space" errors or
> "timeout" errors).
> Input matrix is a sparse matrix 20 000 000 X 150 000 with ~0,03% non-zero
> values. (8GB total)
>

This should not happen if you use at least -Xmx1G for your MR tasks (it
looks like you do). In fact, i would be more worried about ABt job (since
you use -q=1) -- these guys are really memory hogs. Also try to be a bit
less ambitious and run -k100 first although that would not have any
measurable bearing on memory required, only on the running time.

I also do not understand the rationale
behind -Dmapred.max.split.size=1000000. Default split size should be good
enough.

But i have nothing definite to put my finger on with your configuration.

It is possible that sometimes you encounter extra dense vectors
(superactive user) which in your case may be up to 20M ratings, or 160M per
vector, but assuming -Xmx2G and k=200, p=15 the memory  should be more than
plenty.

Most useful advice, if you are hung up on running SVD on your data, is read
thru operational setup here
http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
pages 165 and on. Nathan conducted set ups on inputs as big as 90Gb of very
sparse data. (I am guessing ABt job has improved a little bit since then
but still it is a bottleneck).


>
> How I use it:
>
> ====================
> mahout-distribution-0.7/bin/mahout ssvd \
> -i /tmp/pabramov/sparse/tfidf-vectors/ \
> -o /tmp/pabramov/ssvd \
> -k 200 \
> -q 1 \
> --reduceTasks 150 \
> --tempDir /tmp/pabramov/tmp \
> -Dmapred.max.split.size=1000000 \
> -ow
> ====================
>
> Can't pass Bt-job... Should I decrease split.size and/or add extra params?
> Hadoop has 400 Map and 300 reduce slots with 1 CPU core and 2GB RAM per
> task.
> Q-job completes in 20 minutes.
>
> Many thanks in advance!
>
> Pavel
>
>
> ________________________________________
> От: Dmitriy Lyubimov [dlieu.7@gmail.com]
> Отправлено: 15 ноября 2012 г. 21:53
> To: user@mahout.apache.org
> Тема: Re: SSVD fails on seq2sparse output.
>
> On Thu, Nov 15, 2012 at 3:43 AM, Abramov Pavel <p.abramov@rambler-co.ru
> >wrote:
>
> >
> > Many thanks in advance, any suggestion is highly appreciated. I Don't
> know
> > what to do, CF produces inaccurate results for my tasks, SVD is the only
> > hope ))
> >
>
> I also doubtful about that. (if you trying to factorize our recommendation
> space). SVD has proven to be notoriously inadequate for that problem.
> ALS-WR would be a much better first stab.
>
> however since you seem to be performing text analysis (seq2sparse), i don't
> see immediately how it is related to collaborative filtering -- perhaps if
> you told more about your problem, i am sure here are people on this list
> who could advise you about perhaps one of the best courses of action.
>
>
> > Regards,
> > Pavel
> >
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message