Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.191.145.13
 which is an MX secondary for dev@mahout.apache.org)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2070.6\))
Subject: Re: Streaming and incremental cooccurrence
From: Pat Ferrel <pat@occamsmachete.com>
In-Reply-To: 
 <CANg8BGBb5fTUew+naTrSR1qta4vi=LMoyPiRoEVnvkZzp6ZgVw@mail.gmail.com>
Date: Sat, 18 Apr 2015 08:50:57 -0700
Content-Transfer-Encoding: quoted-printable
Message-Id: <FAF7DA68-CC38-4705-9219-6AABF6232B26@occamsmachete.com>
References: <37368FF2-F2E1-4D73-9405-80C8C7A7179A@occamsmachete.com>
 <CAJwFCa1MGWepGpt=_t5WaiNTNgOGTxZHNZ-v27uttcBvP104bQ@mail.gmail.com>
 <EB07BA12-8FFC-4BB0-9DB3-03760B9E87D6@occamsmachete.com>
 <1D2AEC1B-8810-4904-B128-CEB426C2EC0E@gmail.com>
 <CANg8BGBb5fTUew+naTrSR1qta4vi=LMoyPiRoEVnvkZzp6ZgVw@mail.gmail.com>
To: dev@mahout.apache.org

I think you are saying that instead of val newHashMap =3D lastHashMap ++ =
updateHashMap, layered updates might be useful since new and last are =
potentially large. Some limit of updates might trigger a refresh. This =
might work if the update works with incremental index updates in the =
search engine. Given practical considerations the updates will be =
numerous and nearly empty.

On Apr 17, 2015, at 7:58 PM, Andrew Musselman =
<andrew.musselman@gmail.com> wrote:

I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.

That is, between batch refreshes you can keep tacking on new updates in =
a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated layer.

You can put a fractional multiplier on older layers for aging but again
I've not implemented it.

On Friday, April 17, 2015, Ted Dunning <ted.dunning@gmail.com> wrote:

>=20
> Yes. Also add the fact that the nano batches are bounded tightly in =
size
> both max and mean. And mostly filtered away anyway.
>=20
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging =
to
> go right.
>=20
> Search updates any more are true realtime also so that works very =
well.
>=20
> Sent from my iPhone
>=20
>> On Apr 17, 2015, at 17:20, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
>>=20
>> Thanks.
>>=20
>> This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical =
input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
>>=20
>> A couple practical issues
>>=20
>> In practice won=E2=80=99t this require aging of interactions too? So =
wouldn=E2=80=99t
> the update require some old interaction removal? I suppose this might =
just
> take the form of added null interactions representing the geriatric =
ones?
> Haven=E2=80=99t gone through the math with enough detail to see if =
you=E2=80=99ve already
> accounted for this.
>>=20
>> To use actual math (self-join, etc.) we still need to alter the =
geometry
> of the interactions to have the same row rank as the adjusted total. =
In
> other words the number of rows in all resulting interactions must be =
the
> same. Over time this means completely removing rows and columns or =
allowing
> empty rows in potentially all input matrices.
>>=20
>> Might not be too bad to accumulate gaps in rows and columns. Not sure =
if
> it would have a practical impact (to some large limit) as long as it =
was
> done, to keep the real size more or less fixed.
>>=20
>> As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a =
problem
> afaik. As you point out the query always works and is real time. The =
index
> update must be frequent and not impact the engine's availability for
> queries.
>>=20
>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <ted.dunning@gmail.com
> <javascript:;>> wrote:
>>=20
>>=20
>> When I think of real-time adaptation of indicators, I think of this:
>>=20
>>=20
> =
http://www.slideshare.net/tdunning/realtime-puppies-and-ponies-evolving-in=
dicator-recommendations-in-realtime
>>=20
>>=20
>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <pat@occamsmachete.com
> <javascript:;>> wrote:
>>> I=E2=80=99ve been thinking about Streaming (continuous input) and =
incremental
> coccurrence.
>>>=20
>>> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for =
all
> input, and an update frequency that recalcs all input currently in the =
time
> window. I=E2=80=99ve done this with the current cooccurrence code but =
though
> streaming, this is not incremental.
>>>=20
>>> The current data flow goes from interaction input to geometry and =
user
> dictionary reconciliation to A=E2=80=99A, A=E2=80=99B etc. After the =
multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>>=20
>>> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?
>>=20
>>=20
>=20