Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 19F0790AF for ; Fri, 2 Dec 2011 11:20:38 +0000 (UTC) Received: (qmail 54073 invoked by uid 500); 2 Dec 2011 11:20:36 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 54025 invoked by uid 500); 2 Dec 2011 11:20:36 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 54017 invoked by uid 99); 2 Dec 2011 11:20:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Dec 2011 11:20:36 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Manuel.Blechschmidt@gmx.de designates 213.165.64.23 as permitted sender) Received: from [213.165.64.23] (HELO mailout-de.gmx.net) (213.165.64.23) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 02 Dec 2011 11:20:30 +0000 Received: (qmail invoked by alias); 02 Dec 2011 11:20:07 -0000 Received: from 91-64-145-111-dynip.superkabel.de (EHLO manuel-blechschmidts-macbook-pro-2.fritz.box) [91.64.145.111] by mail.gmx.net (mp013) with SMTP; 02 Dec 2011 12:20:07 +0100 X-Authenticated: #2167237 X-Provags-ID: V01U2FsdGVkX18CyWl+slgCIUZGW2sQ/QynDAyszkv1d4m/c+b6xW R8qVzPS3+xIuhc Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Apple Message framework v1084) Subject: Re: Mahout performance issues From: Manuel Blechschmidt In-Reply-To: Date: Fri, 2 Dec 2011 12:20:09 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <097973D3-6FDC-4743-B895-0AFC37CBB289@gmx.de> References: <21068ABA-D5DD-4AB0-9BCF-178D4688E120@gmx.de> <4ED79A4D.40209@apache.org> <4ED7C32A.6090504@apache.org> To: user@mahout.apache.org X-Mailer: Apple Mail (2.1084) X-Y-GMX-Trusted: 0 Hello Daniel, On 02.12.2011, at 12:02, Daniel Zohar wrote: > Hi guys, >=20 > ... > I just ran the fix I proposed earlier and I got great results! The = query > time was reduced to about a third for the 'heavy users'. Before it was = 1-5 > secs and now it's 0.5-1.5. The best part is that the accuracy level = should > remain exactly the same. I also believe it should reduce memory > consumption, as the GenericBooleanPrefDataModel.preferenceForItems = gets > significantly smaller (in my case at least). It would be great if you could measure your run time performance and = your accuracy with the provided Mahout tools. In your case because you only have boolean feedback precision and recall = would make sense. https://cwiki.apache.org/MAHOUT/recommender-documentation.html RecommenderIRStatsEvaluator evaluator =3D new = GenericRecommenderIRStatsEvaluator(); IRStatistics stats =3D evaluator.evaluate(builder, null, myModel, null, = 3, RecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0); Here is some example code from me: public void testEvaluateRecommender() { try { DataModel myModel =3D new = MyModelImplementationDataModel(); =09 // Users: 12858 // Items: 5467 // MaxPreference: 85850.0 // MinPreference: 50.0 System.out.println("Users: = "+myModel.getNumUsers()); System.out.println("Items: = "+myModel.getNumItems()); System.out.println("MaxPreference: = "+myModel.getMaxPreference()); System.out.println("MinPreference: = "+myModel.getMinPreference()); RecommenderBuilder randomBased =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = RandomRecommender(model); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBased =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = GenericItemBasedRecommender(model, new = PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBasedCosine =3D = new RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = GenericItemBasedRecommender(model, new = UncenteredCosineSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericItemBasedLikely =3D = new RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here return new = GenericItemBasedRecommender(model, new = LogLikelihoodSimilarity(model)); } }; =09 RecommenderBuilder genericUserBasedNN3 =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = GenericUserBasedRecommender( model, new = NearestNUserNeighborhood( = 3, = new PearsonCorrelationSimilarity(model), = model), new = PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder genericUserBasedNN20 =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = GenericUserBasedRecommender( model, new = NearestNUserNeighborhood( = 20, = new PearsonCorrelationSimilarity(model), = model), new = PearsonCorrelationSimilarity(model)); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder slopeOneBased =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = SlopeOneRecommender(model); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; RecommenderBuilder svdBased =3D new = RecommenderBuilder() { public Recommender = buildRecommender(DataModel model) { // build and return the = Recommender to evaluate here try { return new = SVDRecommender(model, new ALSWRFactorizer( model, = 100, 0.3, 5)); } catch (TasteException e) { // TODO Auto-generated = catch block e.printStackTrace(); return null; } } }; // Data Set Summary: // 12858 users // 121304 preferences RecommenderEvaluator evaluator =3D new = AverageAbsoluteDifferenceRecommenderEvaluator(); double evaluation =3D = evaluator.evaluate(randomBased, null, myModel, 0.9, 1.0); // Evaluation of randomBased (baseline): = 43045.380570443434 // (RandomRecommender(model)) System.out.println("Evaluation of randomBased = (baseline): " + evaluation); // evaluation =3D = evaluator.evaluate(genericItemBased, null, myModel, // 0.9, 1.0); // Evaluation of ItemBased with Pearson = Correlation: // 315.5804958647985 = (GenericItemBasedRecommender(model, // PearsonCorrelationSimilarity(model)) // System.out // .println("Evaluation of ItemBased with = Pearson Correlation: " // + evaluation); // evaluation =3D = evaluator.evaluate(genericItemBasedCosine, null, // myModel, 0.9, 1.0); // Evaluation of ItemBase with uncentered = Cosine: 198.25393235323375 // (GenericItemBasedRecommender(model, // UncenteredCosineSimilarity(model))) // System.out // .println("Evaluation of ItemBased with = Uncentered Cosine: " // + evaluation); =09 evaluation =3D = evaluator.evaluate(genericItemBasedLikely, null, myModel, 0.9, 1.0); // Evaluation of ItemBase with log likelihood: = 176.45243607278724 // (GenericItemBasedRecommender(model, // LogLikelihoodSimilarity(model))) System.out .println("Evaluation of = ItemBased with LogLikelihood: " + evaluation); =09 =09 // User based is slow and inaccurate // evaluation =3D = evaluator.evaluate(genericUserBasedNN3, null, // myModel, 0.9, 1.0); // Evaluation of UserBased 3 with Pearson = Correlation: // 1774.9897130330407 = (GenericUserBasedRecommender(model, // NearestNUserNeighborhood(3, = PearsonCorrelationSimilarity(model), // model), PearsonCorrelationSimilarity(model))) // took about 2 minutes // System.out.println("Evaluation of UserBased 3 = with Pearson Correlation: "+evaluation); // evaluation =3D = evaluator.evaluate(genericUserBasedNN20, null, // myModel, 0.9, 1.0); // Evaluation of UserBased 20 with Pearson // Correlation:1329.137324225053 = (GenericUserBasedRecommender(model, // NearestNUserNeighborhood(20, = PearsonCorrelationSimilarity(model), // model), PearsonCorrelationSimilarity(model))) // took about 3 minutes // System.out.println("Evaluation of UserBased = 20 with Pearson Correlation: "+evaluation); // evaluation =3D = evaluator.evaluate(slopeOneBased, null, myModel, // 0.9, 1.0); // Evaluation of SlopeOne: 464.8989330869532 // (SlopeOneRecommender(model)) // System.out.println("Evaluation of SlopeOne: = "+evaluation); // evaluation =3D evaluator.evaluate(svdBased, = null, myModel, 0.9, // 1.0); // Evaluation of SVD based: 378.9776153202042 // (ALSWRFactorizer(model, 100, 0.3, 5)) // took about 10 minutes to calculate on a Mac = Book Pro // System.out.println("Evaluation of SVD based: = "+evaluation); } catch (TasteException e) { // TODO Auto-generated catch block e.printStackTrace(); } } >=20 > The fix is merely adding two lines of code to one of > the GenericBooleanPrefDataModel constructors. See > http://pastebin.com/K5PB68Et, the lines I added are #11, #22. >=20 > The only problem I see at the moment, is that the similarities > implementations are using the num of users per item in the > item-item similarity calculation. This _can_ be mitigated by creating = an > additional Map in the DataModel which maps itemID to numUsers. >=20 > What do you think about the proposed solution? Perhaps I am missing = some > other implications? >=20 > Thanks! >=20 >=20 > On Fri, Dec 2, 2011 at 12:51 AM, Sean Owen wrote: >=20 >> (Agree, and the sampling happens at the user level now -- so if you = sample >> one of these users, it slows down a lot. The spirit of the proposed = change >> is to make sampling more fine-grained, at the individual item level. = That >> seems to certainly fix this.) >>=20 >> On Thu, Dec 1, 2011 at 10:46 PM, Ted Dunning >> wrote: >>=20 >>> This may or may not help much. My guess is that the improvement = will be >>> very modest. >>>=20 >>> The most serious problem is going to be recommendations for anybody = who >> has >>> rated one of these excessively popular items. That item will bring = in a >>> huge number of other users and thus a huge number of items to = consider. >> If >>> you down-sample ratings of the prolific users and kill super-common >> items, >>> I think you will see much more improvement than simply eliminating = the >>> singleton users. >>>=20 >>> The basic issue is that cooccurrence based algorithms have run-time >>> proportional to O(n_max^2) where n_max is the maximum number of = items per >>> user. >>>=20 >>> On Thu, Dec 1, 2011 at 2:35 PM, Daniel Zohar = wrote: >>>=20 >>>> This is why I'm looking now into improving = GenericBooleanPrefDataModel >> to >>>> not take into account users which made one interaction under the >>>> 'preferenceForItems' Map. What do you think about this approach? >>>>=20 >>>=20 >>=20 --=20 Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B