mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Streaming and incremental cooccurrence
Date Sat, 18 Apr 2015 15:50:57 GMT
I think you are saying that instead of val newHashMap = lastHashMap ++ updateHashMap, layered
updates might be useful since new and last are potentially large. Some limit of updates might
trigger a refresh. This might work if the update works with incremental index updates in the
search engine. Given practical considerations the updates will be numerous and nearly empty.

On Apr 17, 2015, at 7:58 PM, Andrew Musselman <> wrote:

I have not implemented it for recommendations but a layered cache/sieve
structure could be useful.

That is, between batch refreshes you can keep tacking on new updates in a
cascading order so values that are updated exist in the newest layer but
otherwise the lookup goes for the latest updated layer.

You can put a fractional multiplier on older layers for aging but again
I've not implemented it.

On Friday, April 17, 2015, Ted Dunning <> wrote:

> Yes. Also add the fact that the nano batches are bounded tightly in size
> both max and mean. And mostly filtered away anyway.
> Aging is an open question. I have never seen any effect of alternative
> sampling so I would just assume "keep oldest" which just tosses more
> samples. Then occasionally rebuild from batch if you really want aging to
> go right.
> Search updates any more are true realtime also so that works very well.
> Sent from my iPhone
>> On Apr 17, 2015, at 17:20, Pat Ferrel <
> <javascript:;>> wrote:
>> Thanks.
>> This idea is based on a micro-batch of interactions per update, not
> individual ones unless I missed something. That matches the typical input
> flow. Most interactions are filtered away by  frequency and number of
> interaction cuts.
>> A couple practical issues
>> In practice won’t this require aging of interactions too? So wouldn’t
> the update require some old interaction removal? I suppose this might just
> take the form of added null interactions representing the geriatric ones?
> Haven’t gone through the math with enough detail to see if you’ve already
> accounted for this.
>> To use actual math (self-join, etc.) we still need to alter the geometry
> of the interactions to have the same row rank as the adjusted total. In
> other words the number of rows in all resulting interactions must be the
> same. Over time this means completely removing rows and columns or allowing
> empty rows in potentially all input matrices.
>> Might not be too bad to accumulate gaps in rows and columns. Not sure if
> it would have a practical impact (to some large limit) as long as it was
> done, to keep the real size more or less fixed.
>> As to realtime, that would be under search engine control through
> incremental indexing and there are a couple ways to do that, not a problem
> afaik. As you point out the query always works and is real time. The index
> update must be frequent and not impact the engine's availability for
> queries.
>> On Apr 17, 2015, at 2:46 PM, Ted Dunning <
> <javascript:;>> wrote:
>> When I think of real-time adaptation of indicators, I think of this:
>>> On Fri, Apr 17, 2015 at 6:51 PM, Pat Ferrel <
> <javascript:;>> wrote:
>>> I’ve been thinking about Streaming (continuous input) and incremental
> coccurrence.
>>> As interactions stream in from the user it it fairly simple to use
> something like Spark streaming to maintain a moving time window for all
> input, and an update frequency that recalcs all input currently in the time
> window. I’ve done this with the current cooccurrence code but though
> streaming, this is not incremental.
>>> The current data flow goes from interaction input to geometry and user
> dictionary reconciliation to A’A, A’B etc. After the multiply the resulting
> cooccurrence matrices are LLR weighted/filtered/down-sampled.
>>> Incremental can mean all sorts of things and may imply different
> trade-offs. Did you have anything specific in mind?

View raw message