mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Schelter <...@apache.org>
Subject Re: Need for a distributed SVDRecommender
Date Wed, 24 Nov 2010 07:34:15 GMT
Hi Sanjib,

MAHOUT-542 uses a different algorithmic approach to factorize the matrix 
(as described in "Large-scale Parallel Collaborative Filtering for the 
Netflix Prize" 
http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf

), it is not related to MAHOUT-371.

On 24.11.2010 07:28, Sanjib Kumar Das wrote:
>  From what I understand Mahout-371 tries to address the
> DistributedSVDRecommenderJob. Is it fully ready for use?
>
> @Sebastian : The above recommender uses the DistributedLanczosSolver to
> achieve the SVD. So, should the distributed Matrix Factorization(Mahout-542)
> you were talking about be integrated with it instead?
>
> I am slightly confused....
> On Fri, Nov 19, 2010 at 4:32 PM, Ted Dunning<ted.dunning@gmail.com>  wrote:
>
>    
>> On Fri, Nov 19, 2010 at 2:27 PM, Sebastian Schelter<ssc@apache.org>
>> wrote:
>>
>>      
>>> Can I use the new LanczosSolver to
>>>        
>>>>> achieve this?
>>>>>            
>>> The paper "Large-scale Parallel Collaborative Filtering for the Netflix
>>> Prize" says that you can't use Lanczos to factorize a rating matrix as
>>> it is only partially specified. However someone with more mathematical
>>> expertise than me should validate that statement, hope I didn't get that
>>> wrong :)
>>>
>>>        
>> You correctly quoted the statement.  But I don't think that the statement
>> is
>> entirely
>> correct.  The difference in practice isn't all that big a deal.
>>
>>
>>      
>>> Ted is working on LatentFactorLogLinear models in MAHOUT-525 which can
>>> be used for recommendations too and should be superior to the approach
>>> of MAHOUT-542. They're not distributed but in the paper in which they
>>> are described the authors state that they could train the 1M Movielens
>>> Dataset in 7 minutes so they should be fast enough for your testcase.
>>>
>>>        
>> This is where I would push for recommendations.  I have a preliminary
>> implementation
>> available on github, but I don't think it is ready to commit.  It does do
>> roughly what it
>> is supposed to do (on one test) but I don't have enough runtime with it to
>> have any
>> level of confidence yet.
>>
>>      
>    


Mime
View raw message