Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 266B0200B33 for ; Wed, 29 Jun 2016 13:32:42 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 25044160A57; Wed, 29 Jun 2016 11:32:42 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 6ABA3160A4D for ; Wed, 29 Jun 2016 13:32:41 +0200 (CEST) Received: (qmail 48155 invoked by uid 500); 29 Jun 2016 11:32:40 -0000 Mailing-List: contact dev-help@apex.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@apex.apache.org Delivered-To: mailing list dev@apex.apache.org Received: (qmail 48142 invoked by uid 99); 29 Jun 2016 11:32:40 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Jun 2016 11:32:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9E104187BD7 for ; Wed, 29 Jun 2016 11:32:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=datatorrent-com.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id m3xDW7PhpiVd for ; Wed, 29 Jun 2016 11:32:37 +0000 (UTC) Received: from mail-vk0-f53.google.com (mail-vk0-f53.google.com [209.85.213.53]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id E7DAD60D22 for ; Wed, 29 Jun 2016 11:32:36 +0000 (UTC) Received: by mail-vk0-f53.google.com with SMTP id u68so19062481vkf.2 for ; Wed, 29 Jun 2016 04:32:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datatorrent-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=IQ9nbx1doe4COSDVlwUXOIZbxPeNnJnNPZm4PY/sOXY=; b=yWJ20fAi7K7/RmW8Zqh7occnzU68sspTZUQMsC9cGNGkhTLs5IO1kQU+ItGC7Oy28F LYoXNFY4lfSZnO6HetxCeZp/TN08jJWoXedolPlJAOIszE38JoslogcyvjlmTmSR2tVt DfVG0khPK/ayt9XZOm4lVPFGbR4kH7Ufqd09lcVYOVmBRMY4TjFcpl7Ivz1PDOSqHNYX UC9qOQZVE4JP5sIKIjmS2BY5ST7qpW33OdJwW0RF2keSf/z1K+r/QYIgG1ChRh5vReDL 3DfkCJv3qo8uqPAczPV3Qckfch//EQHeU/iozfxd+zZLNYt23WeEF6Gt81brGc+1+xfI msbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=IQ9nbx1doe4COSDVlwUXOIZbxPeNnJnNPZm4PY/sOXY=; b=H+3wD0kMtcFdCQiphoWnHmshd4FoDMq2Eee/FTmezop0F6PdyTaZZhQRjAOid7HgF1 8ltg/8cJ5KseVHO/oSBeP0PmWL4JQgWJ0d23XbJ5xFuyCRQ728cxzUX5/d3W/zKIDIhb FfQmTETk6ZQJxSjKkBLJNdjS2HXs6Yk2RrYophF+jxcm4/Bje4v2gypu1H+OwCs0fjLj Zpvh1mlXPxG4nHmbV3c2NAk4A1a16aY3PCxy+nxZFajT73alZ/hgO4TP3OeOMh29XJAp qi3z1f3QRG5vtR93i5kcYYud3kJvgOn8MRZBATpNp0U/ZuUJVR//JU27+MDRguE4S3qN ABvg== X-Gm-Message-State: ALyK8tKIrZiKayIdjJLyu2mzzevEdrhqrM1mbq7koMu+5PMwhHveHzi4xMKGQmN03YMT4PTI7mgnE3UoioqVVQY9 X-Received: by 10.159.35.111 with SMTP id 102mr3350201uae.43.1467199950320; Wed, 29 Jun 2016 04:32:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.132.131 with HTTP; Wed, 29 Jun 2016 04:32:10 -0700 (PDT) In-Reply-To: References: From: Bhupesh Chawda Date: Wed, 29 Jun 2016 17:02:10 +0530 Message-ID: Subject: Re: APEXMALHAR-1701 Deduper in Malhar To: dev Content-Type: multipart/alternative; boundary=001a1142939cda8b050536691b77 archived-at: Wed, 29 Jun 2016 11:32:42 -0000 --001a1142939cda8b050536691b77 Content-Type: text/plain; charset=UTF-8 Hi All, I want to validate the use cases for de-duplication that will be going as part of this implementation. - *Bounded data set* - This is de-duplication for bounded data. For example, data sets which are old or fixed or which may not have a time field at all. Example: Last year's transaction records or Customer data etc. - Concept of expiry is not needed as this is bounded data set. - *Unbounded data set* - This is de-duplication of online streaming data - Expiry is needed because here incoming tuples may arrive later than what they are expected. Expiry is always computed by taking the difference in System time and the Event time. Any feedback is appreciated. Thanks. ~ Bhupesh On Mon, Jun 27, 2016 at 11:34 AM, Bhupesh Chawda wrote: > Hi All, > > I am working on adding a De-duplication operator in Malhar library based > on managed state APIs. I will be working off the already created JIRA - > https://issues.apache.org/jira/browse/APEXMALHAR-1701 and the initial > pull request for an AbstractDeduper here: > https://github.com/apache/apex-malhar/pull/260/files > > I am planning to include the following features in the first version: > 1. Time based de-duplication. Assumption: Tuple_Key -> Tuple_Time > correlation holds. > 2. Option to maintain order of incoming tuples. > 3. Duplicate and Expired ports to emit duplicate and expired tuples > respectively. > > Thanks. > > ~ Bhupesh > --001a1142939cda8b050536691b77--