mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han JU <ju.han.fe...@gmail.com>
Subject Re: ALS-WR on Million Song dataset
Date Wed, 20 Mar 2013 11:38:38 GMT
Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts
and 8 threads for each mapper. Now the job runs smoothly and the whole
factorization ends in 45min. With your settings I think it should be even
faster.

One more thing is that the RecommendJob is kind of slow (for all users).
For example I want to have a list of top 500 items to recommend. Any
pointers about how to modify the job code so that it can consult a file
then calculates recommendations only for the users id in that file?


2013/3/20 Han JU <ju.han.felix@gmail.com>

> Hi Sebastian,
>
> I've tried the svn trunk. Hadoop constantly complains about memory like
> "out of memory error".
> On the datanode there's 4 physic cores and by hyper-threading it has 16
> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
> a problem with memory.
> How you set your mapred.child.java.opts? Given that we allow only one
> mapper so that should be nearly the whole size of system memory?
>
> Thanks!
>
>
> 2013/3/19 Sebastian Schelter <ssc@apache.org>
>
>> Hi JU,
>>
>> We recently rewrote the factorization code, it should be much faster
>> now. You should use the current trunk, make Hadoop schedule only one
>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>> number of cores that you want to use per machine (use all if you can).
>>
>> I got astonishing results running the code like this on a 26 machines
>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>> (700M datapoints).
>>
>> Let me know if you need more information.
>>
>> Best,
>> Sebastian
>>
>> On 19.03.2013 15:31, Han JU wrote:
>> > Thanks Sebastian and Sean, I will dig more into the paper.
>> > With a simple try on a small part of the data, it seems larger alpha
>> (~40)
>> > gets me a better result.
>> > Do you have an idea how long it will be for ParellelALS for the 700mb
>> > complete dataset? It contains ~48 million triples. The hadoop cluster I
>> > dispose is of 5 nodes and can factorize the movieLens 10M in about
>> 13min.
>> >
>> >
>> > 2013/3/18 Sebastian Schelter <ssc@apache.org>
>> >
>> >> You should also be aware that the alpha parameter comes from a formula
>> >> the authors introduce to measure the "confidence" in the observed
>> values:
>> >>
>> >> confidence = 1 + alpha * observed_value
>> >>
>> >> You can also change that formula in the code to something that you see
>> >> more fit, the paper even suggests alternative variants.
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >>
>> >> On 18.03.2013 18:06, Han JU wrote:
>> >>> Thanks for quick responses.
>> >>>
>> >>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>> >>> play_times", of ~ 1m users. No audio things, just plein text triples.
>> >>>
>> >>> It seems to me that the paper about "implicit feedback" matchs well
>> this
>> >>> dataset: no explicit ratings, but times of listening to a song.
>> >>>
>> >>> Thank you Sean for the alpha value, I think they use big numbers is
>> >> because
>> >>> their values in the R matrix is big.
>> >>>
>> >>>
>> >>> 2013/3/18 Sebastian Schelter <ssc.open@googlemail.com>
>> >>>
>> >>>> JU,
>> >>>>
>> >>>> are you refering to this dataset?
>> >>>>
>> >>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>> >>>>
>> >>>> On 18.03.2013 17:47, Sean Owen wrote:
>> >>>>> One word of caution, is that there are at least two papers on
ALS
>> and
>> >>>> they
>> >>>>> define lambda differently. I think you are talking about
>> "Collaborative
>> >>>>> Filtering for Implicit Feedback Datasets".
>> >>>>>
>> >>>>> I've been working with some folks who point out that alpha=40
seems
>> to
>> >> be
>> >>>>> too high for most data sets. After running some tests on common
data
>> >>>> sets,
>> >>>>> alpha=1 looks much better. YMMV.
>> >>>>>
>> >>>>> In the end you have to evaluate these two parameters, and the
# of
>> >>>>> features, across a range to determine what's best.
>> >>>>>
>> >>>>> Is this data set not a bunch of audio features? I am not sure
it
>> works
>> >>>> for
>> >>>>> ALS, not naturally at least.
>> >>>>>
>> >>>>>
>> >>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.felix@gmail.com>
>> >> wrote:
>> >>>>>
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> I'm wondering has someone tried ParallelALS with implicite
feedback
>> >> job
>> >>>> on
>> >>>>>> million song dataset? Some pointers on alpha and lambda?
>> >>>>>>
>> >>>>>> In the paper alpha is 40 and lambda is 150, but I don't
know what
>> are
>> >>>> their
>> >>>>>> r values in the matrix. They said is based on time units
that users
>> >> have
>> >>>>>> watched the show, so may be it's big.
>> >>>>>>
>> >>>>>> Many thanks!
>> >>>>>> --
>> >>>>>> *JU Han*
>> >>>>>>
>> >>>>>> UTC   -  Université de Technologie de Compiègne
>> >>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>> >>>>>>
>> >>>>>> +33 0619608888
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>>
>
>
> --
> *JU Han*
>
> Software Engineer Intern @ KXEN Inc.
> UTC   -  Université de Technologie de Compiègne
> *     **GI06 - Fouille de Données et Décisionnel*
>
> +33 0619608888
>



-- 
*JU Han*

Software Engineer Intern @ KXEN Inc.
UTC   -  Université de Technologie de Compiègne
*     **GI06 - Fouille de Données et Décisionnel*

+33 0619608888

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message