mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peng Cheng <>
Subject Re: Regarding Online Recommenders
Date Fri, 19 Jul 2013 21:15:48 GMT

Just one simple question: Is the 
org.apache.mahout.math.BinarySearch.binarySearch() function an optimized 
version of Arrays.binarySearch()? If it is not, why implement it again?

Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:
> You are completely right, the simple interface would only be usable for
> readonly / batch-updatable recommenders. Online recommenders might need
> something different. I tried to widen the discussion here to discuss all
> kinds of API changes in the recommenders that would be necessary in the
> future.
> 2013/7/17 Peng Cheng <>
>> One thing that suddenly comes to my mind is that, for a simple interface
>> like FactorizablePreferences, maybe sequential READ in real time is
>> possible, but sequential WRITE in O(1) time is Utopia. Because you need to
>> flush out old preference with same user and item ID (in worst case it could
>> be an interpolation search), otherwise you are permitting a user rating an
>> item twice with different values. Considering how FileDataModel suppose to
>> work (new files flush old files), maybe using the simple interface has less
>> advantages than we used to believe.
>> On 13-07-17 04:58 PM, Sebastian Schelter wrote:
>>> Hi Peng,
>>> I never wanted to discard the old interface, I just wanted to split it up.
>>> I want to have a simple interface that only supports sequential access
>>> (and
>>> allows for very memory efficient implementions, e.g. by the use of
>>> primitive arrays). DataModel should *extend* this interface and provide
>>> sequential and random access (basically what is already does).
>>> Than a recommender such as SGD could state that it only needs sequential
>>> access to the preferences and you can either feed it a DataModel (so we
>>> don"t break backwards compatibility) or a memory efficient sequential
>>> access thingy.
>>> Does that make sense for you?
>>> 2013/7/17 Peng Cheng <>
>>>   I see, OK so we shouldn't use the old implementation. But I mean, the old
>>>> interface doesn't have to be discarded. The discrepancy between your
>>>> FactorizablePreferences and DataModel is that, your model supports
>>>> getPreferences(), which returns all preferences as an iterator, and
>>>> DataModel supports a few old functions that returns preferences for an
>>>> individual user or item.
>>>> My point is that, it is not hard for each of them to implement what they
>>>> lack of: old DataModel can implement getPreferences() just by a a loop in
>>>> abstract class. Your new FactorizablePreferences can implement those old
>>>> functions by a binary search that takes O(log n) time, or an
>>>> interpolation
>>>> search that takes O(log log n) time in average. So does the online
>>>> update.
>>>> It will just be a matter of different speed and space, but not different
>>>> interface standard, we can use old unit tests, old examples, old
>>>> everything. And we will be more flexible in writing ensemble recommender.
>>>> Just a few thoughts, I'll have to validate the idea first before creating
>>>> a new JIRA ticket.
>>>> Yours Peng
>>>> On 13-07-16 02:51 PM, Sebastian Schelter wrote:
>>>>   I completely agree, Netflix is less than one gigabye in a smart
>>>>> representation, 12x more memory is a nogo. The techniques used in
>>>>> FactorizablePreferences allow a much more memory efficient
>>>>> representation,
>>>>> tested on KDD Music dataset which is approx 2.5 times Netflix and fits
>>>>> into
>>>>> 3GB with that approach.
>>>>> 2013/7/16 Ted Dunning <>
>>>>>    Netflix is a small dataset.  12G for that seems quite excessive.
>>>>>> Note also that this is before you have done any work.
>>>>>> Ideally, 100million observations should take << 1GB.
>>>>>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng <>
>>>>>> wrote:
>>>>>>    The second idea is indeed splendid, we should separate
>>>>>> time-complexity
>>>>>>> first and space-complexity first implementation. What I'm not
>>>>>>> sure,
>>>>>>> is that if we really need to create two interfaces instead of
>>>>>>> Personally, I think 12G heap space is not that high right? Most
>>>>>>>   laptop
>>>>>>   can already handle that (emphasis on laptop). And if we replace
>>>>>>> map
>>>>>>> (the culprit of high memory consumption) with list/linkedList,
>>>>>>> would
>>>>>>> simply degrade time complexity for a linear search to O(n), not
>>>>>>> bad
>>>>>>> either. The current DataModel is a result of careful thoughts
and has
>>>>>>> underwent extensive test, it is easier to expand on top of it
>>>>>>> of
>>>>>>> subverting it.

View raw message