Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0FD60F6F8 for ; Wed, 20 Mar 2013 18:17:41 +0000 (UTC) Received: (qmail 81982 invoked by uid 500); 20 Mar 2013 18:17:39 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 81899 invoked by uid 500); 20 Mar 2013 18:17:38 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 81886 invoked by uid 99); 20 Mar 2013 18:17:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 18:17:38 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-ee0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Mar 2013 18:17:32 +0000 Received: by mail-ee0-f48.google.com with SMTP id t10so1312864eei.21 for ; Wed, 20 Mar 2013 11:17:10 -0700 (PDT) X-Received: by 10.14.210.8 with SMTP id t8mr74176067eeo.35.1363803430111; Wed, 20 Mar 2013 11:17:10 -0700 (PDT) Received: from [192.168.1.8] (p5DC5BD23.dip.t-dialin.net. [93.197.189.35]) by mx.google.com with ESMTPS id k7sm4109699een.8.2013.03.20.11.17.08 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 20 Mar 2013 11:17:09 -0700 (PDT) Message-ID: <5149FD20.7040108@apache.org> Date: Wed, 20 Mar 2013 19:17:04 +0100 From: Sebastian Schelter Reply-To: ssc@apache.org User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Re: ALS-WR on Million Song dataset References: <51474650.4040007@googlemail.com> <51474AEB.8050200@apache.org> <51487F41.3010304@apache.org> In-Reply-To: X-Enigmail-Version: 1.4.6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi JU, I reworked the RecommenderJob in a similar way as the ALS job. Can you give it a try? You have to try the patch from https://issues.apache.org/jira/browse/MAHOUT-1169 In introduces a new param to RecommenderJob called --numThreads. The configuration of the job should be done similar to the ALS job. /s On 20.03.2013 12:38, Han JU wrote: > Thanks again Sebastian and Seon, I set -Xmx4000m for mapred.child.java.opts > and 8 threads for each mapper. Now the job runs smoothly and the whole > factorization ends in 45min. With your settings I think it should be even > faster. > > One more thing is that the RecommendJob is kind of slow (for all users). > For example I want to have a list of top 500 items to recommend. Any > pointers about how to modify the job code so that it can consult a file > then calculates recommendations only for the users id in that file? > > > 2013/3/20 Han JU > >> Hi Sebastian, >> >> I've tried the svn trunk. Hadoop constantly complains about memory like >> "out of memory error". >> On the datanode there's 4 physic cores and by hyper-threading it has 16 >> logical cores, so I set --numThreadsPerSolver to 16 and that seems to have >> a problem with memory. >> How you set your mapred.child.java.opts? Given that we allow only one >> mapper so that should be nearly the whole size of system memory? >> >> Thanks! >> >> >> 2013/3/19 Sebastian Schelter >> >>> Hi JU, >>> >>> We recently rewrote the factorization code, it should be much faster >>> now. You should use the current trunk, make Hadoop schedule only one >>> mapper per machine (with -Dmapred.tasktracker.map.tasks.maximum=1), make >>> it reuse the JVMs and add the parameter --numThreadsPerSolver with the >>> number of cores that you want to use per machine (use all if you can). >>> >>> I got astonishing results running the code like this on a 26 machines >>> cluster on the Netflix dataset (100M datapoints) and Yahoo Songs dataset >>> (700M datapoints). >>> >>> Let me know if you need more information. >>> >>> Best, >>> Sebastian >>> >>> On 19.03.2013 15:31, Han JU wrote: >>>> Thanks Sebastian and Sean, I will dig more into the paper. >>>> With a simple try on a small part of the data, it seems larger alpha >>> (~40) >>>> gets me a better result. >>>> Do you have an idea how long it will be for ParellelALS for the 700mb >>>> complete dataset? It contains ~48 million triples. The hadoop cluster I >>>> dispose is of 5 nodes and can factorize the movieLens 10M in about >>> 13min. >>>> >>>> >>>> 2013/3/18 Sebastian Schelter >>>> >>>>> You should also be aware that the alpha parameter comes from a formula >>>>> the authors introduce to measure the "confidence" in the observed >>> values: >>>>> >>>>> confidence = 1 + alpha * observed_value >>>>> >>>>> You can also change that formula in the code to something that you see >>>>> more fit, the paper even suggests alternative variants. >>>>> >>>>> Best, >>>>> Sebastian >>>>> >>>>> >>>>> On 18.03.2013 18:06, Han JU wrote: >>>>>> Thanks for quick responses. >>>>>> >>>>>> Yes it's that dataset. What I'm using is triplets of "user_id song_id >>>>>> play_times", of ~ 1m users. No audio things, just plein text triples. >>>>>> >>>>>> It seems to me that the paper about "implicit feedback" matchs well >>> this >>>>>> dataset: no explicit ratings, but times of listening to a song. >>>>>> >>>>>> Thank you Sean for the alpha value, I think they use big numbers is >>>>> because >>>>>> their values in the R matrix is big. >>>>>> >>>>>> >>>>>> 2013/3/18 Sebastian Schelter >>>>>> >>>>>>> JU, >>>>>>> >>>>>>> are you refering to this dataset? >>>>>>> >>>>>>> http://labrosa.ee.columbia.edu/millionsong/tasteprofile >>>>>>> >>>>>>> On 18.03.2013 17:47, Sean Owen wrote: >>>>>>>> One word of caution, is that there are at least two papers on ALS >>> and >>>>>>> they >>>>>>>> define lambda differently. I think you are talking about >>> "Collaborative >>>>>>>> Filtering for Implicit Feedback Datasets". >>>>>>>> >>>>>>>> I've been working with some folks who point out that alpha=40 seems >>> to >>>>> be >>>>>>>> too high for most data sets. After running some tests on common data >>>>>>> sets, >>>>>>>> alpha=1 looks much better. YMMV. >>>>>>>> >>>>>>>> In the end you have to evaluate these two parameters, and the # of >>>>>>>> features, across a range to determine what's best. >>>>>>>> >>>>>>>> Is this data set not a bunch of audio features? I am not sure it >>> works >>>>>>> for >>>>>>>> ALS, not naturally at least. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 18, 2013 at 12:39 PM, Han JU >>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm wondering has someone tried ParallelALS with implicite feedback >>>>> job >>>>>>> on >>>>>>>>> million song dataset? Some pointers on alpha and lambda? >>>>>>>>> >>>>>>>>> In the paper alpha is 40 and lambda is 150, but I don't know what >>> are >>>>>>> their >>>>>>>>> r values in the matrix. They said is based on time units that users >>>>> have >>>>>>>>> watched the show, so may be it's big. >>>>>>>>> >>>>>>>>> Many thanks! >>>>>>>>> -- >>>>>>>>> *JU Han* >>>>>>>>> >>>>>>>>> UTC - Universit� de Technologie de Compi�gne >>>>>>>>> * **GI06 - Fouille de Donn�es et D�cisionnel* >>>>>>>>> >>>>>>>>> +33 0619608888 >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> >> -- >> *JU Han* >> >> Software Engineer Intern @ KXEN Inc. >> UTC - Universit� de Technologie de Compi�gne >> * **GI06 - Fouille de Donn�es et D�cisionnel* >> >> +33 0619608888 >> > > >