mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: ALS-WR on Million Song dataset
Date Tue, 19 Mar 2013 15:07:45 GMT
Hi JU,

We recently rewrote the factorization code, it should be much faster
now. You should use the current trunk, make Hadoop schedule only one
mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make
it reuse the JVMs and add the parameter --numThreadsPerSolver with the
number of cores that you want to use per machine (use all if you can).

I got astonishing results running the code like this on a 26 machines
cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset
(700M datapoints).

Let me know if you need more information.

Best,
Sebastian

On 19.03.2013 15:31, Han JU wrote:
> Thanks Sebastian and Sean, I will dig more into the paper.
> With a simple try on a small part of the data, it seems larger alpha (~40)
> gets me a better result.
> Do you have an idea how long it will be for ParellelALS for the 700mb
> complete dataset? It contains ~48 million triples. The hadoop cluster I
> dispose is of 5 nodes and can factorize the movieLens 10M in about 13min.
> 
> 
> 2013/3/18 Sebastian Schelter <ssc@apache.org>
> 
>> You should also be aware that the alpha parameter comes from a formula
>> the authors introduce to measure the "confidence" in the observed values:
>>
>> confidence = 1 + alpha * observed_value
>>
>> You can also change that formula in the code to something that you see
>> more fit, the paper even suggests alternative variants.
>>
>> Best,
>> Sebastian
>>
>>
>> On 18.03.2013 18:06, Han JU wrote:
>>> Thanks for quick responses.
>>>
>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id
>>> play_times", of ~ 1m users. No audio things, just plein text triples.
>>>
>>> It seems to me that the paper about "implicit feedback" matchs well this
>>> dataset: no explicit ratings, but times of listening to a song.
>>>
>>> Thank you Sean for the alpha value, I think they use big numbers is
>> because
>>> their values in the R matrix is big.
>>>
>>>
>>> 2013/3/18 Sebastian Schelter <ssc.open@googlemail.com>
>>>
>>>> JU,
>>>>
>>>> are you refering to this dataset?
>>>>
>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile
>>>>
>>>> On 18.03.2013 17:47, Sean Owen wrote:
>>>>> One word of caution, is that there are at least two papers on ALS and
>>>> they
>>>>> define lambda differently. I think you are talking about "Collaborative
>>>>> Filtering for Implicit Feedback Datasets".
>>>>>
>>>>> I've been working with some folks who point out that alpha=40 seems to
>> be
>>>>> too high for most data sets. After running some tests on common data
>>>> sets,
>>>>> alpha=1 looks much better. YMMV.
>>>>>
>>>>> In the end you have to evaluate these two parameters, and the # of
>>>>> features, across a range to determine what's best.
>>>>>
>>>>> Is this data set not a bunch of audio features? I am not sure it works
>>>> for
>>>>> ALS, not naturally at least.
>>>>>
>>>>>
>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU <ju.han.felix@gmail.com>
>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback
>> job
>>>> on
>>>>>> million song dataset? Some pointers on alpha and lambda?
>>>>>>
>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what
are
>>>> their
>>>>>> r values in the matrix. They said is based on time units that users
>> have
>>>>>> watched the show, so may be it's big.
>>>>>>
>>>>>> Many thanks!
>>>>>> --
>>>>>> *JU Han*
>>>>>>
>>>>>> UTC   -  Université de Technologie de Compiègne
>>>>>> *     **GI06 - Fouille de Données et Décisionnel*
>>>>>>
>>>>>> +33 0619608888
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
> 


Mime
View raw message