Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24381173A6 for ; Sat, 18 Apr 2015 15:51:33 +0000 (UTC) Received: (qmail 42638 invoked by uid 500); 18 Apr 2015 15:51:32 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 42561 invoked by uid 500); 18 Apr 2015 15:51:32 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 42550 invoked by uid 99); 18 Apr 2015 15:51:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Apr 2015 15:51:32 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for dev@mahout.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Apr 2015 15:51:27 +0000 Received: from mail-pd0-f177.google.com (mail-pd0-f177.google.com [209.85.192.177]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 9279326DE6 for ; Sat, 18 Apr 2015 15:51:07 +0000 (UTC) Received: by pdea3 with SMTP id a3so159463813pde.3 for ; Sat, 18 Apr 2015 08:51:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=bwsUgQjyvup/0lLc3GQCajawhyNREbP3xIo165XPg1s=; b=WGpJaF2ATkJx1YyQWxG4RRdZuyhENk43uGOZqZw18fXyHrapBVnU2/NBzOdhCdWtAV 0clIqjwoNeIaSBLKVrPUU3oLkLnSvkJipumkMHSo+dwjr09SZdRheKXXvFv9x0GSteEw 3v+2FvaTV5zpDNCZ9uMmH0/gE8KadiNN26uG+IS5qWx4NRmDWCl+U/a4qgJBR60wGq7F tU/dFK22Nktez4wslxRWuOY7v0Mt7+bUsiIqT77Ds1QshGoa0aJTqO7OqSX240kPs+Hy zq4iGLxvjZWa8Rvqe70pzzg5CIp3HmG5XVQsbEts4LJkwaTVuANgpQCwbKqYkx28uM6T ljgw== X-Gm-Message-State: ALoCoQlnWeiLeH/t3rL93kO0LqIvzEvzsBd7yQCI/N4uM7kljqC9Bp8/OpGx9xov2b3z0SMxm6Cj X-Received: by 10.70.125.162 with SMTP id mr2mr14216241pdb.21.1429372260816; Sat, 18 Apr 2015 08:51:00 -0700 (PDT) Received: from [192.168.0.7] ([63.142.207.22]) by mx.google.com with ESMTPSA id w17sm13335677pdj.6.2015.04.18.08.50.58 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sat, 18 Apr 2015 08:50:59 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\)) Subject: Re: Streaming and incremental cooccurrence From: Pat Ferrel In-Reply-To: Date: Sat, 18 Apr 2015 08:50:57 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <37368FF2-F2E1-4D73-9405-80C8C7A7179A@occamsmachete.com> <1D2AEC1B-8810-4904-B128-CEB426C2EC0E@gmail.com> To: dev@mahout.apache.org X-Mailer: Apple Mail (2.2070.6) X-Virus-Checked: Checked by ClamAV on apache.org I think you are saying that instead of val newHashMap =3D lastHashMap ++ = updateHashMap, layered updates might be useful since new and last are = potentially large. Some limit of updates might trigger a refresh. This = might work if the update works with incremental index updates in the = search engine. Given practical considerations the updates will be = numerous and nearly empty. On Apr 17, 2015, at 7:58 PM, Andrew Musselman = wrote: I have not implemented it for recommendations but a layered cache/sieve structure could be useful. That is, between batch refreshes you can keep tacking on new updates in = a cascading order so values that are updated exist in the newest layer but otherwise the lookup goes for the latest updated layer. You can put a fractional multiplier on older layers for aging but again I've not implemented it. On Friday, April 17, 2015, Ted Dunning wrote: >=20 > Yes. Also add the fact that the nano batches are bounded tightly in = size > both max and mean. And mostly filtered away anyway. >=20 > Aging is an open question. I have never seen any effect of alternative > sampling so I would just assume "keep oldest" which just tosses more > samples. Then occasionally rebuild from batch if you really want aging = to > go right. >=20 > Search updates any more are true realtime also so that works very = well. >=20 > Sent from my iPhone >=20 >> On Apr 17, 2015, at 17:20, Pat Ferrel > wrote: >>=20 >> Thanks. >>=20 >> This idea is based on a micro-batch of interactions per update, not > individual ones unless I missed something. That matches the typical = input > flow. Most interactions are filtered away by frequency and number of > interaction cuts. >>=20 >> A couple practical issues >>=20 >> In practice won=E2=80=99t this require aging of interactions too? So = wouldn=E2=80=99t > the update require some old interaction removal? I suppose this might = just > take the form of added null interactions representing the geriatric = ones? > Haven=E2=80=99t gone through the math with enough detail to see if = you=E2=80=99ve already > accounted for this. >>=20 >> To use actual math (self-join, etc.) we still need to alter the = geometry > of the interactions to have the same row rank as the adjusted total. = In > other words the number of rows in all resulting interactions must be = the > same. Over time this means completely removing rows and columns or = allowing > empty rows in potentially all input matrices. >>=20 >> Might not be too bad to accumulate gaps in rows and columns. Not sure = if > it would have a practical impact (to some large limit) as long as it = was > done, to keep the real size more or less fixed. >>=20 >> As to realtime, that would be under search engine control through > incremental indexing and there are a couple ways to do that, not a = problem > afaik. As you point out the query always works and is real time. The = index > update must be frequent and not impact the engine's availability for > queries. >>=20 >> On Apr 17, 2015, at 2:46 PM, Ted Dunning > wrote: >>=20 >>=20 >> When I think of real-time adaptation of indicators, I think of this: >>=20 >>=20 > = http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-in= dicator-recommendations-in-realtime >>=20 >>=20 >>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel > wrote: >>> I=E2=80=99ve been thinking about Streaming (continuous input) and = incremental > coccurrence. >>>=20 >>> As interactions stream in from the user it it fairly simple to use > something like Spark streaming to maintain a moving time window for = all > input, and an update frequency that recalcs all input currently in the = time > window. I=E2=80=99ve done this with the current cooccurrence code but = though > streaming, this is not incremental. >>>=20 >>> The current data flow goes from interaction input to geometry and = user > dictionary reconciliation to A=E2=80=99A, A=E2=80=99B etc. After the = multiply the resulting > cooccurrence matrices are LLR weighted/filtered/down-sampled. >>>=20 >>> Incremental can mean all sorts of things and may imply different > trade-offs. Did you have anything specific in mind? >>=20 >>=20 >=20