Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CF70410221 for ; Wed, 17 Jul 2013 20:58:20 +0000 (UTC) Received: (qmail 10119 invoked by uid 500); 17 Jul 2013 20:58:20 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 10060 invoked by uid 500); 17 Jul 2013 20:58:20 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 10052 invoked by uid 99); 17 Jul 2013 20:58:20 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 20:58:20 +0000 Received: from localhost (HELO mail-wg0-f54.google.com) (127.0.0.1) (smtp-auth username ssc, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jul 2013 20:58:19 +0000 Received: by mail-wg0-f54.google.com with SMTP id n11so2219122wgh.33 for ; Wed, 17 Jul 2013 13:58:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=VTyFlI+ReNoFYBqT/qzknLv4oGksmvzZJy0vyzWB5WQ=; b=KYj0jguEuonU9zJIE2SOtAQHsruKbR1lB3eYhGzYl8AX4LlrXQy0Oz5MpHTL3M6Hez sKEb3R4THHZVhEOZbM1dLO8Z3tDFIR3PbyiQU51W/5PpkpcqvZYBjBFfBNNuDKZb9mAA uFZBOT5UWOAi09+mVDEp7NTXtPE/qmIW4HNzAckZi/AxGWEWbINZTnT/vxkshfMksxSG b2cUPunUVSbMPHeaYUX4QD8JUteciEOJJuCituu/bZM4zLeF67DLBwG+9r9tDmM3I7QI lyn9PPDwP94BRJ7anAZJwDAv/2xhVhbnS64SRL3Lz/1m5af9iiKR3EjAIdlNlEMSxjMv eybw== MIME-Version: 1.0 X-Received: by 10.180.211.7 with SMTP id my7mr17169247wic.26.1374094697978; Wed, 17 Jul 2013 13:58:17 -0700 (PDT) Received: by 10.194.26.37 with HTTP; Wed, 17 Jul 2013 13:58:17 -0700 (PDT) In-Reply-To: <51E703B2.7020406@uowmail.edu.au> References: <51E4D4A1.3040505@googlemail.com> <51E5646F.3010607@uowmail.edu.au> <51E703B2.7020406@uowmail.edu.au> Date: Wed, 17 Jul 2013 13:58:17 -0700 Message-ID: Subject: Re: Regarding Online Recommenders From: Sebastian Schelter To: dev@mahout.apache.org Content-Type: multipart/alternative; boundary=001a11c259f05c61e904e1bb5a36 --001a11c259f05c61e904e1bb5a36 Content-Type: text/plain; charset=UTF-8 Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we don"t break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng > I see, OK so we shouldn't use the old implementation. But I mean, the old > interface doesn't have to be discarded. The discrepancy between your > FactorizablePreferences and DataModel is that, your model supports > getPreferences(), which returns all preferences as an iterator, and > DataModel supports a few old functions that returns preferences for an > individual user or item. > > My point is that, it is not hard for each of them to implement what they > lack of: old DataModel can implement getPreferences() just by a a loop in > abstract class. Your new FactorizablePreferences can implement those old > functions by a binary search that takes O(log n) time, or an interpolation > search that takes O(log log n) time in average. So does the online update. > It will just be a matter of different speed and space, but not different > interface standard, we can use old unit tests, old examples, old > everything. And we will be more flexible in writing ensemble recommender. > > Just a few thoughts, I'll have to validate the idea first before creating > a new JIRA ticket. > > Yours Peng > > > > On 13-07-16 02:51 PM, Sebastian Schelter wrote: > >> I completely agree, Netflix is less than one gigabye in a smart >> representation, 12x more memory is a nogo. The techniques used in >> FactorizablePreferences allow a much more memory efficient representation, >> tested on KDD Music dataset which is approx 2.5 times Netflix and fits >> into >> 3GB with that approach. >> >> >> 2013/7/16 Ted Dunning >> >> Netflix is a small dataset. 12G for that seems quite excessive. >>> >>> Note also that this is before you have done any work. >>> >>> Ideally, 100million observations should take << 1GB. >>> >>> On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng >>> wrote: >>> >>> The second idea is indeed splendid, we should separate time-complexity >>>> first and space-complexity first implementation. What I'm not quite >>>> sure, >>>> is that if we really need to create two interfaces instead of one. >>>> Personally, I think 12G heap space is not that high right? Most new >>>> >>> laptop >>> >>>> can already handle that (emphasis on laptop). And if we replace hash map >>>> (the culprit of high memory consumption) with list/linkedList, it would >>>> simply degrade time complexity for a linear search to O(n), not too bad >>>> either. The current DataModel is a result of careful thoughts and has >>>> underwent extensive test, it is easier to expand on top of it instead of >>>> subverting it. >>>> >>> > > --001a11c259f05c61e904e1bb5a36--