Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
MIME-Version: 1.0
In-Reply-To: <51E703B2.7020406@uowmail.edu.au>
References: 
 <CAFj5n5E3Z4n8CbZqzugzD-Ljn-Kib2L42mnzCQMTerRgjk8uJQ@mail.gmail.com>
	<51E4D4A1.3040505@googlemail.com>
	<51E5646F.3010607@uowmail.edu.au>
	<CAJwFCa264A7HRO4YGsT5ptQCfxQC8HBcR_0NeXomA_PF__sEHg@mail.gmail.com>
	<CADHDM+YoHGskaagFfbgJnKNxdNmQoxn1zdk8iY0aKkhfkoSFig@mail.gmail.com>
	<51E703B2.7020406@uowmail.edu.au>
Date: Wed, 17 Jul 2013 13:58:17 -0700
Message-ID: 
 <CADHDM+ZLT-RGbDLCwTpXWwF+YwmMkX+00twMQabQ8UGRB9OOdg@mail.gmail.com>
Subject: Re: Regarding Online Recommenders
From: Sebastian Schelter <ssc@apache.org>
To: dev@mahout.apache.org
Content-Type: multipart/alternative; boundary=001a11c259f05c61e904e1bb5a36

--001a11c259f05c61e904e1bb5a36
Content-Type: text/plain; charset=UTF-8

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
don"t break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng <pc175@uowmail.edu.au>

> I see, OK so we shouldn't use the old implementation. But I mean, the old
> interface doesn't have to be discarded. The discrepancy between your
> FactorizablePreferences and DataModel is that, your model supports
> getPreferences(), which returns all preferences as an iterator, and
> DataModel supports a few old functions that returns preferences for an
> individual user or item.
>
> My point is that, it is not hard for each of them to implement what they
> lack of: old DataModel can implement getPreferences() just by a a loop in
> abstract class. Your new FactorizablePreferences can implement those old
> functions by a binary search that takes O(log n) time, or an interpolation
> search that takes O(log log n) time in average. So does the online update.
> It will just be a matter of different speed and space, but not different
> interface standard, we can use old unit tests, old examples, old
> everything. And we will be more flexible in writing ensemble recommender.
>
> Just a few thoughts, I'll have to validate the idea first before creating
> a new JIRA ticket.
>
> Yours Peng
>
>
>
> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>
>> I completely agree, Netflix is less than one gigabye in a smart
>> representation, 12x more memory is a nogo. The techniques used in
>> FactorizablePreferences allow a much more memory efficient representation,
>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>> into
>> 3GB with that approach.
>>
>>
>> 2013/7/16 Ted Dunning <ted.dunning@gmail.com>
>>
>>  Netflix is a small dataset.  12G for that seems quite excessive.
>>>
>>> Note also that this is before you have done any work.
>>>
>>> Ideally, 100million observations should take << 1GB.
>>>
>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <pc175@uowmail.edu.au>
>>> wrote:
>>>
>>>  The second idea is indeed splendid, we should separate time-complexity
>>>> first and space-complexity first implementation. What I'm not quite
>>>> sure,
>>>> is that if we really need to create two interfaces instead of one.
>>>> Personally, I think 12G heap space is not that high right? Most new
>>>>
>>> laptop
>>>
>>>> can already handle that (emphasis on laptop). And if we replace hash map
>>>> (the culprit of high memory consumption) with list/linkedList, it would
>>>> simply degrade time complexity for a linear search to O(n), not too bad
>>>> either. The current DataModel is a result of careful thoughts and has
>>>> underwent extensive test, it is easier to expand on top of it instead of
>>>> subverting it.
>>>>
>>>
>
>

--001a11c259f05c61e904e1bb5a36--