apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Weise <tho...@datatorrent.com>
Subject Re: APEXMALHAR-1701 Deduper in Malhar
Date Wed, 29 Jun 2016 15:09:59 GMT
Bhupesh,

Why is there a distinction between bounded and unbounded data? I see the
former as a special case of the latter?

When rewinding the stream or reprocessing the stream in another run the
operator should produce the same result.

This operator should be idempotent also. That implies that code does not
rely on current system time but the window timestamp instead.

All of this should be accomplished by using the windowing support:
https://github.com/apache/apex-malhar/pull/319

Thanks,
Thomas






On Wed, Jun 29, 2016 at 4:32 AM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> Hi All,
>
> I want to validate the use cases for de-duplication that will be going as
> part of this implementation.
>
>    - *Bounded data set*
>       - This is de-duplication for bounded data. For example, data sets
>       which are old or fixed or which may not have a time field at
> all. Example:
>       Last year's transaction records or Customer data etc.
>       - Concept of expiry is not needed as this is bounded data set.
>       - *Unbounded data set*
>       - This is de-duplication of online streaming data
>       - Expiry is needed because here incoming tuples may arrive later than
>       what they are expected. Expiry is always computed by taking the
> difference
>       in System time and the Event time.
>
> Any feedback is appreciated.
>
> Thanks.
>
> ~ Bhupesh
>
> On Mon, Jun 27, 2016 at 11:34 AM, Bhupesh Chawda <bhupesh@datatorrent.com>
> wrote:
>
> > Hi All,
> >
> > I am working on adding a De-duplication operator in Malhar library based
> > on managed state APIs. I will be working off the already created JIRA -
> > https://issues.apache.org/jira/browse/APEXMALHAR-1701 and the initial
> > pull request for an AbstractDeduper here:
> > https://github.com/apache/apex-malhar/pull/260/files
> >
> > I am planning to include the following features in the first version:
> > 1. Time based de-duplication. Assumption: Tuple_Key -> Tuple_Time
> > correlation holds.
> > 2. Option to maintain order of incoming tuples.
> > 3. Duplicate and Expired ports to emit duplicate and expired tuples
> > respectively.
> >
> > Thanks.
> >
> > ~ Bhupesh
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message