Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of ssc.open@googlemail.com
 designates 74.125.82.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=googlemail.com; s=gamma;
        h=message-id:date:from:user-agent:mime-version:to:subject:references
         :in-reply-to:content-type;
        b=PZhU/brJLDXRX8DpugYoMKfQyNDfcjkYYY6+ivIfdWwRMq8m5imXGeAJNMGn7A9eLt
         91YcRxeIrh8njVpzJkmf3gfWaG86ERCfkrqYknI5q6qHBGO3HVTx3u3U9wj/1LC1fgky
         wtkBgaMECLEaoPh+MZWIlW5RJ0DWi8r/PyVTc=
Message-ID: <4C4845C6.3010104@googlemail.com>
Date: Thu, 22 Jul 2010 15:21:10 +0200
From: Sebastian Schelter <ssc.open@googlemail.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.1.10) Gecko/20100528 Thunderbird/3.0.5
MIME-Version: 1.0
To: user@mahout.apache.org
Subject: Re: How to combine boolean datamodel with datamodel
References: <4C4821FA.7010908@googlemail.com>
 <4C4695C3.8080905@googlemail.com> <4C4692E2.4040302@googlemail.com>
 <AANLkTinQSDs2aYc-TnbOvJU2c6Khps-hb1D56aagetKO@mail.gmail.com>
 <178e55e.6873.129e9c99223.Coremail.woshidustin@126.com>
 <AANLkTimPjq91glYfNRRw1loer0DB0KWXTCNXOv6ARgDa@mail.gmail.com>
 <19d2cdc.6d8a.129e9dcf29c.Coremail.woshidustin@126.com>
 <AANLkTikd9bTaXplVX1WnPmwsak4cDAdh9YmvV6QnuDmn@mail.gmail.com>
 <17505889.1ef04.129f0814c7c.Coremail.woshidustin@126.com>
 <1c170b0.4558.129f3919617.Coremail.woshidustin@126.com>
 <18b3c31.4c37.129f3b50565.Coremail.woshidustin@126.com>
 <fd92c0.7230.129f4509466.Coremail.woshidustin@126.com>
 <4C46CE08.2030008@googlemail.com>
 <AANLkTil3GVyoiZHeMNsV7hdaER6Hfzr0YjAskPfkebmv@mail.gmail.com>
 <4C47326D.7000401@googlemail.com>
 <AANLkTimHIKCGGPRNHGQX60uyItH7-nBXajsv-m_t1gwt@mail.gmail.com>
 <AANLkTilEgUkJu8q4LY6QBgKrmwiaQ5-wAYmOf_YnWuVa@mail.gmail.com>
 <1a8cfc1.1228f.129fa4b02d1.Coremail.woshidustin@126.com>
In-Reply-To: <1a8cfc1.1228f.129fa4b02d1.Coremail.woshidustin@126.com>
Content-Type: multipart/alternative;
 boundary="------------030505060101040207090008"

--------------030505060101040207090008
Content-Type: text/plain; charset=x-gbk
Content-Transfer-Encoding: 7bit

It's attached here: *https://issues.apache.org/jira/browse/MAHOUT-445*

If you want to use the testcode you sent yesterday with the patch, you
would need to change the way the recommender is created to:

new GenericItemBasedRecommender(model, itemSimilarity, new
AllUnknownItemsCandidateItemsStrategy())

--sebastian

Am 22.07.2010 15:15, schrieb Young:
> Hi Sebastian,
> Thank you. Where can we download the patch?
>  
> ---Young
>
>
>
>
>
>   
>> Hi all,
>>
>> I did a little refactoring today to be able to inject customized ways of
>> fetching the candidate items. I wrote another implementation that just
>> returns all items not yet rated by the user. This won't be suitable for
>> large datasets but it did quite well for the grouplens dataset (some
>> testing results attached). I'm gonna create a patch so you can have a
>> look at the refactoring and if you decide to commit it, it could be a
>> suitable starting point for implementing Ted's proposed way of candidate
>> item fetching.
>>
>> Another advantage of that patch is that users could supply use-case
>> specific implementations of candidate item fetching without having to
>> subclass the recommender of their choice.
>>
>> --sebastian
>>
>> Tests for random users with different candidate item fetching strategies
>> (grouplens dataset)
>>
>> User 1063
>> found 3605 items in 2376ms (current approach)
>> found 3606 items in 1ms (all unknown items)
>>
>> User 3596
>> found 3575 items in 1889ms (current approach)
>> found 3578 items in 2ms (all unknown items)
>>
>> User 3300
>> found 3343 items in 6603ms (current approach)
>> found 3344 items in 0ms (all unknown items)
>>
>> User 924
>> found 3507 items in 4173ms (current approach)
>> found 3507 items in 4ms (all unknown items)
>>
>> User 4505
>> found 3427 items in 4774ms (current approach)
>> found 3427 items in 1ms (all unknown items)
>>
>> User 3378
>> found 3471 items in 4225ms (current approach)
>> found 3471 items in 0ms (all unknown items)
>>
>> User 246
>> found 3673 items in 730ms (current approach)
>> found 3677 items in 0ms (all unknown items)
>>
>>
>> Am 22.07.2010 02:00, schrieb Ted Dunning:
>>     
>>> This is a ubiquitous problem with coocurrence algorithms since they scale in
>>> the square of the number of occurrences most popular item.
>>>
>>> The good news is that you learn everything there is to learn about that item
>>> if you look at just a sampling of the occurrences so sampling is your
>>> friend.  If there is temporal structure, I tend to bias the sample toward
>>> recent items.
>>>
>>> Regarding the size, I have generally had an arbitrary cutoff attached to a
>>> configuration knob in my production systems.  It is probably reasonable to
>>> set this limit to something like max(100, 20*log(max(N_users, N_items))).
>>>  This isn't really any less arbitrary, but it will probably never need
>>> tweaking in normal use.
>>>   
>>>       
>>     


--------------030505060101040207090008--