Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 85902 invoked from network); 22 Jul 2010 13:21:48 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Jul 2010 13:21:48 -0000 Received: (qmail 88886 invoked by uid 500); 22 Jul 2010 13:21:48 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 88778 invoked by uid 500); 22 Jul 2010 13:21:46 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 88769 invoked by uid 99); 22 Jul 2010 13:21:45 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jul 2010 13:21:45 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com designates 74.125.82.42 as permitted sender) Received: from [74.125.82.42] (HELO mail-ww0-f42.google.com) (74.125.82.42) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jul 2010 13:21:37 +0000 Received: by wwf26 with SMTP id 26so4140862wwf.1 for ; Thu, 22 Jul 2010 06:21:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type; bh=vkXtfoLdi/tgxOlbmrdnB9CF7mB+q7uWfXxv5A49dwg=; b=qZlES50lQoHqPWhLYs04TznVzGOgM/uV3r/rHaRoLND2x+fRepR2GqLeVhZQ6C1k7C ZMNX6tFZFkG3kyVt7bUd89Pixr/a6sP3eJONTCt/yfjtQuOQ3AHU2FncQXg9lo+MM6Dv gmRq6rKkSrHlhtIt1AyFEV8z7MdgwKYgOMHvU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=googlemail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; b=PZhU/brJLDXRX8DpugYoMKfQyNDfcjkYYY6+ivIfdWwRMq8m5imXGeAJNMGn7A9eLt 91YcRxeIrh8njVpzJkmf3gfWaG86ERCfkrqYknI5q6qHBGO3HVTx3u3U9wj/1LC1fgky wtkBgaMECLEaoPh+MZWIlW5RJ0DWi8r/PyVTc= Received: by 10.227.133.66 with SMTP id e2mr1861383wbt.132.1279804876675; Thu, 22 Jul 2010 06:21:16 -0700 (PDT) Received: from [192.168.0.100] (f052141200.adsl.alicedsl.de [78.52.141.200]) by mx.google.com with ESMTPS id i25sm54419052wbi.4.2010.07.22.06.21.11 (version=SSLv3 cipher=RC4-MD5); Thu, 22 Jul 2010 06:21:12 -0700 (PDT) Message-ID: <4C4845C6.3010104@googlemail.com> Date: Thu, 22 Jul 2010 15:21:10 +0200 From: Sebastian Schelter User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.10) Gecko/20100528 Thunderbird/3.0.5 MIME-Version: 1.0 To: user@mahout.apache.org Subject: Re: How to combine boolean datamodel with datamodel References: <4C4821FA.7010908@googlemail.com> <4C4695C3.8080905@googlemail.com> <4C4692E2.4040302@googlemail.com> <178e55e.6873.129e9c99223.Coremail.woshidustin@126.com> <19d2cdc.6d8a.129e9dcf29c.Coremail.woshidustin@126.com> <17505889.1ef04.129f0814c7c.Coremail.woshidustin@126.com> <1c170b0.4558.129f3919617.Coremail.woshidustin@126.com> <18b3c31.4c37.129f3b50565.Coremail.woshidustin@126.com> <4C46CE08.2030008@googlemail.com> <4C47326D.7000401@googlemail.com> <1a8cfc1.1228f.129fa4b02d1.Coremail.woshidustin@126.com> In-Reply-To: <1a8cfc1.1228f.129fa4b02d1.Coremail.woshidustin@126.com> Content-Type: multipart/alternative; boundary="------------030505060101040207090008" X-Virus-Checked: Checked by ClamAV on apache.org --------------030505060101040207090008 Content-Type: text/plain; charset=x-gbk Content-Transfer-Encoding: 7bit It's attached here: *https://issues.apache.org/jira/browse/MAHOUT-445* If you want to use the testcode you sent yesterday with the patch, you would need to change the way the recommender is created to: new GenericItemBasedRecommender(model, itemSimilarity, new AllUnknownItemsCandidateItemsStrategy()) --sebastian Am 22.07.2010 15:15, schrieb Young: > Hi Sebastian, > Thank you. Where can we download the patch? > > ---Young > > > > > > >> Hi all, >> >> I did a little refactoring today to be able to inject customized ways of >> fetching the candidate items. I wrote another implementation that just >> returns all items not yet rated by the user. This won't be suitable for >> large datasets but it did quite well for the grouplens dataset (some >> testing results attached). I'm gonna create a patch so you can have a >> look at the refactoring and if you decide to commit it, it could be a >> suitable starting point for implementing Ted's proposed way of candidate >> item fetching. >> >> Another advantage of that patch is that users could supply use-case >> specific implementations of candidate item fetching without having to >> subclass the recommender of their choice. >> >> --sebastian >> >> Tests for random users with different candidate item fetching strategies >> (grouplens dataset) >> >> User 1063 >> found 3605 items in 2376ms (current approach) >> found 3606 items in 1ms (all unknown items) >> >> User 3596 >> found 3575 items in 1889ms (current approach) >> found 3578 items in 2ms (all unknown items) >> >> User 3300 >> found 3343 items in 6603ms (current approach) >> found 3344 items in 0ms (all unknown items) >> >> User 924 >> found 3507 items in 4173ms (current approach) >> found 3507 items in 4ms (all unknown items) >> >> User 4505 >> found 3427 items in 4774ms (current approach) >> found 3427 items in 1ms (all unknown items) >> >> User 3378 >> found 3471 items in 4225ms (current approach) >> found 3471 items in 0ms (all unknown items) >> >> User 246 >> found 3673 items in 730ms (current approach) >> found 3677 items in 0ms (all unknown items) >> >> >> Am 22.07.2010 02:00, schrieb Ted Dunning: >> >>> This is a ubiquitous problem with coocurrence algorithms since they scale in >>> the square of the number of occurrences most popular item. >>> >>> The good news is that you learn everything there is to learn about that item >>> if you look at just a sampling of the occurrences so sampling is your >>> friend. If there is temporal structure, I tend to bias the sample toward >>> recent items. >>> >>> Regarding the size, I have generally had an arbitrary cutoff attached to a >>> configuration knob in my production systems. It is probably reasonable to >>> set this limit to something like max(100, 20*log(max(N_users, N_items))). >>> This isn't really any less arbitrary, but it will probably never need >>> tweaking in normal use. >>> >>> >> --------------030505060101040207090008--