mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: ALS-WR on Million Song dataset
Date Wed, 20 Mar 2013 18:17:04 GMT
Hi JU,

I reworked the RecommenderJob in a similar way as the ALS job. Can you
give it a try?

You have to try the patch from
https://issues.apache.org/jira/browse/MAHOUT-1169

In introduces a new param to RecommenderJob called --numThreads. The
configuration of the job should be done similar to the ALS job.

/s


On 20.03.2013 12:38, Han JU wrote:
> Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts
> and 8 threads for each mapper. Now the job runs smoothly and the whole
> factorization ends in 45min. With your settings I think it should be even
> faster.
> 
> One more thing is that the RecommendJob is kind of slow (for all users).
> For example I want to have a list of top 500 items to recommend. Any
> pointers about how to modify the job code so that it can consult a file
> then calculates recommendations only for the users id in that file?
> 
> 
> 2013/3/20 Han JU <ju.han.felix@gmail.com>
> 
>> Hi Sebastian,
>>
>> I've tried the svn trunk. Hadoop constantly complains about memory like
>> "out of memory error".
>> On the datanode there's 4 physic cores and by hyper-threading it has 16
>> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have
>> a problem with memory.
>> How you set your mapred.child.java.opts? Given that we allow only one
>> mapper so that should be nearly the whole size of system memory?
>>
>> Thanks!
>>
>>
>> 2013/3/19 Sebastian Schelter <ssc@apache.org>
>>
>>> Hi JU,
>>>
>>> We recently rewrote the factorization code, it should be much faster
>>> now. You should use the current trunk, make Hadoop schedule only one
>>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
>>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the
>>> number of cores that you want to use per machine (use all if you can).
>>>
>>> I got astonishing results running the code like this on a 26 machines
>>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
>>> (700M datapoints).
>>>
>>> Let me know if you need more information.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 19.03.2013 15:31, Han JU wrote:
>>>> Thanks Sebastian and Sean, I will dig more into the paper.
>>>> With a simple try on a small part of the data, it seems larger alpha
>>> (~40)
>>>> gets me a better result.
>>>> Do you have an idea how long it will be for ParellelALS for the 700mb
>>>> complete dataset? It contains ~48 million triples. The hadoop cluster I
>>>> dispose is of 5 nodes and can factorize the movieLens 10M in about
>>> 13min.
>>>>
>>>>
>>>> 2013/3/18 Sebastian Schelter <ssc@apache.org>
>>>>
>>>>> You should also be aware that the alpha parameter comes from a formula
>>>>> the authors introduce to measure the "confidence" in the observed
>>> values:
>>>>>
>>>>> confidence = 1 + alpha * observed_value
>>>>>
>>>>> You can also change that formula in the code to something that you see
>>>>> more fit, the paper even suggests alternative variants.
>>>>>
>>>>> Best,
>>>>> Sebastian
>>>>>
>>>>>
>>>>> On 18.03.2013 18:06, Han JU wrote:
>>>>>> Thanks for quick responses.
>>>>>>
>>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>>>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>>>>
>>>>>> It seems to me that the paper about "implicit feedback" matchs well
>>> this
>>>>>> dataset: no explicit ratings, but times of listening to a song.
>>>>>>
>>>>>> Thank you Sean for the alpha value, I think they use big numbers
is
>>>>> because
>>>>>> their values in the R matrix is big.
>>>>>>
>>>>>>
>>>>>> 2013/3/18 Sebastian Schelter <ssc.open@googlemail.com>
>>>>>>
>>>>>>> JU,
>>>>>>>
>>>>>>> are you refering to this dataset?
>>>>>>>
>>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>>>>
>>>>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>>>>> One word of caution, is that there are at least two papers
on ALS
>>> and
>>>>>>> they
>>>>>>>> define lambda differently. I think you are talking about
>>> "Collaborative
>>>>>>>> Filtering for Implicit Feedback Datasets".
>>>>>>>>
>>>>>>>> I've been working with some folks who point out that alpha=40
seems
>>> to
>>>>> be
>>>>>>>> too high for most data sets. After running some tests on
common data
>>>>>>> sets,
>>>>>>>> alpha=1 looks much better. YMMV.
>>>>>>>>
>>>>>>>> In the end you have to evaluate these two parameters, and
the # of
>>>>>>>> features, across a range to determine what's best.
>>>>>>>>
>>>>>>>> Is this data set not a bunch of audio features? I am not
sure it
>>> works
>>>>>>> for
>>>>>>>> ALS, not naturally at least.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.felix@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm wondering has someone tried ParallelALS with implicite
feedback
>>>>> job
>>>>>>> on
>>>>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>>>>
>>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't
know what
>>> are
>>>>>>> their
>>>>>>>>> r values in the matrix. They said is based on time units
that users
>>>>> have
>>>>>>>>> watched the show, so may be it's big.
>>>>>>>>>
>>>>>>>>> Many thanks!
>>>>>>>>> --
>>>>>>>>> *JU Han*
>>>>>>>>>
>>>>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>>>>
>>>>>>>>> +33 0619608888
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> *JU Han*
>>
>> Software Engineer Intern @ KXEN Inc.
>> UTC   -  Université de Technologie de Compiègne
>> *     **GI06 - Fouille de Données et Décisionnel*
>>
>> +33 0619608888
>>
> 
> 
> 


Mime
View raw message