mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <>
Subject Re: Streaming and incremental cooccurrence
Date Thu, 23 Apr 2015 12:53:26 GMT
Removal is not as important as adding (which can be done). Also removal is often for business
logic, like removal from a catalog, so a refresh may be driven by non-math considerations.
Removal of users is only to clean up things, not required very often. Removal of items can
happen from recs too, mitigating the issue.

The way the downsampling works now is to randomly remove interactions if we know there will
be too many so that we end up with the right amount. The incremental approach would filter
out all new interactions that are over the limit since the old interactions are not kept.
This seems to violate the random choice of interactions to cut but now that I think about
it does a random choice really matter?

On Apr 22, 2015, at 10:01 PM, Ted Dunning <> wrote:

On Wed, Apr 22, 2015 at 8:07 PM, Pat Ferrel <> wrote:

> I think we have been talking about an idea that does an incremental
> approximation, then a refresh every so often to remove any approximation so
> in an ideal world we need both.

Actually, the method I was pushing is exact.  If the sampling is made
deterministic using clever seeds, then deletion is even possible since you
can determine whether an observation was thrown away rather than used to
increment counts.

The only creeping crud aspect of this is the accumulation of zero rows as
things fall out of the accumulation window.  I would be tempted to not
allow deletion and just restart as Pat is suggesting.

View raw message