From dev-return-18792-archive-asf-public=cust-asf.ponee.io@nifi.apache.org  Tue Feb 19 13:07:25 2019
Return-Path: <dev-return-18792-archive-asf-public=cust-asf.ponee.io@nifi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 8371F18060E
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 19 Feb 2019 14:07:24 +0100 (CET)
Received: (qmail 43674 invoked by uid 500); 19 Feb 2019 13:07:23 -0000
Mailing-List: contact dev-help@nifi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@nifi.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@nifi.apache.org>
List-Post: <mailto:dev@nifi.apache.org>
List-Id: <dev.nifi.apache.org>
Reply-To: dev@nifi.apache.org
Delivered-To: mailing list dev@nifi.apache.org
Received: (qmail 43660 invoked by uid 99); 19 Feb 2019 13:07:22 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Feb 2019 13:07:22 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7C0F5C0313
	for <dev@nifi.apache.org>; Tue, 19 Feb 2019 13:07:22 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.798
X-Spam-Level: *
X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31
	tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
	DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id xORsXhCIlNJh for <dev@nifi.apache.org>;
	Tue, 19 Feb 2019 13:07:20 +0000 (UTC)
Received: from mail-ot1-f53.google.com (mail-ot1-f53.google.com [209.85.210.53])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 91583610E9
	for <dev@nifi.apache.org>; Tue, 19 Feb 2019 13:07:19 +0000 (UTC)
Received: by mail-ot1-f53.google.com with SMTP id m1so34011549otf.5
        for <dev@nifi.apache.org>; Tue, 19 Feb 2019 05:07:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=9xdgGRidJe6YJuf5rEZ6eA0d/BFrGAMTAE2h15zPsiQ=;
        b=go7TiHbdI4y0S4r0l77R2ZWenTPA017Bu9pF/pukbiHgdmQPWjizG0ICi/IC4Wblln
         c4N+zWSig3yLIvC47A7y27uEAIUL/5vwJdbfi3DKo2+jqTU6A+TjutJrRh7QVFSDrUf+
         mhrgXuP4nC06rW4B8IonVCYyTkGVPmsIX4gjTSNOhW27MFYAl2+wXxvQ01fKzo54knt4
         k+UP7PuhJA9t/cFT1m7uv/2bfF0nbtZZYXM2VKYeaiDiI6evg/rp2VlMSkpRkBzBiSK7
         qMcOehOvrUd9P4mu03fKTDvoeT90GcsJ5kIANMWDukdBruW3/IuJ99BQRx/MQXbdqWmn
         pnSw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=9xdgGRidJe6YJuf5rEZ6eA0d/BFrGAMTAE2h15zPsiQ=;
        b=grZamLRPz4sbg3cSZt7+pcrzPLt22JTuZX2YjcTG8/SFOPGajbG3ywgm9otfDhe83r
         k6fDqIr1KnXnKDDGSVVAJhazGoOJLhrDIfqPfI5L0C8EX1sI8oF0Yl3gqy/fcJUqPeZR
         28GgGsX1YJnzNNJvQAMqoMCA3sZ92jTzWZttOodnxfN9oNwpkgYNuuLlFBRFHJoup6bn
         rHUQW4pls/hCD/QLEe6HkvFFQYAWqksfmRaFAEeqgIzMPJWh55FRv0ZF9u9c17sjrndb
         xWJaAPCGPFb9FSPtmGZcBX0g0c/Jsl/lHNJHF9XJBIEc7NAmY+g+YvwQQgd/GxfABF7b
         cxeg==
X-Gm-Message-State: AHQUAuamOJRsVmU/67MrQyY6+0cgIXlRuwoTOubeNIvEQhNHMfgI/wo/
	qqZueR3HGSOwJBVQblDpQ+8+foZgiW4dgse/Z+q9Xw==
X-Google-Smtp-Source: AHgI3IZ0FH9e6qE+pm1CSG3PWfj5YmyEr69BkHI29OF3+8/daQcXoXWpca/S905p+DdDA5DUQ5dx3zwGZBXQuel57WQ=
X-Received: by 2002:a54:4e18:: with SMTP id a24mr2409327oiy.130.1550581637823;
 Tue, 19 Feb 2019 05:07:17 -0800 (PST)
MIME-Version: 1.0
References: <CABM7KRK9WLhEaTzsRx1sqaF_2Osqkzxi+06Ro+sLzrPhRZDUyg@mail.gmail.com>
 <CALAQ4HvOyTE8GXAbtrOV5cVXEm-yshg6Pi-ZamJV7nUGxTNREA@mail.gmail.com> <CALJK9a6mv4SBtgo0D9p6whr=K2pScyqYV+hcXpM8K4MUKh9fvw@mail.gmail.com>
In-Reply-To: <CALJK9a6mv4SBtgo0D9p6whr=K2pScyqYV+hcXpM8K4MUKh9fvw@mail.gmail.com>
From: Mike Thomsen <mikerthomsen@gmail.com>
Date: Tue, 19 Feb 2019 08:07:06 -0500
Message-ID: <CALAQ4HtP0-vhN+GQCkcFzdj33cQyq0CHLv5GLhozDePLGu6aDw@mail.gmail.com>
Subject: Re: DetectDuplicateRecord Processor
To: dev@nifi.apache.org
Content-Type: multipart/alternative; boundary="000000000000b7c09105823eebe7"

--000000000000b7c09105823eebe7
Content-Type: text/plain; charset="UTF-8"

I'll have to look at Adam's code in more depth, but I think one reason we
might need two is that I didn't see any ability to just check an existing
record path against the cache and call it a day. For teams using a standard
UUID scheme, that's all we'd need or want. Could be wrong abut that and
Adam please let me know if I am.

On Tue, Feb 19, 2019 at 7:28 AM Joe Witt <joe.witt@gmail.com> wrote:

> Mike, Adam,
>
> It appears the distinction of interest here between the two general
> approaches is less about in-mem vs map cache and instead is more about
> approximate/fast detection vs certain/depending on size of cache
> approaches.
>
> I'm not sure if this is quite right or if the distinction warrants two
> processors but this is my first impression.
>
> But it is probably best if the two of you, as contributors to this problem,
> discuss and find consensus.
>
> Thanks
>
> On Sat, Feb 16, 2019 at 9:33 PM Mike Thomsen <mikerthomsen@gmail.com>
> wrote:
>
> > Thanks, Adam. The use case I had, in stereotypical agile fashion could be
> > summarized like this:
> >
> > "As a NiFi user, I want to be able to generate UUIDv5 IDs for all of my
> > record sets and then have a downstream processor check each generated
> UUID
> > against the existing ingested data to see if there is an existing row
> with
> > that UUID."
> >
> > For us, at least, false positives are something that we would need to be
> > fairly aggressive in preventing.
> >
> > One possibility here is that we split the difference with your
> contribution
> > being an in-memory deduplicator and mine going purely against a
> distributed
> > map cache client. I think there might be enough ground to cover that we
> > might want to have two approaches to this problem that specialize rather
> > than a one-size-fits-most single solution.
> >
> > Thanks,
> >
> > Mike
> >
> > On Sat, Feb 16, 2019 at 9:18 PM Adam Fisher <fisher1987@gmail.com>
> wrote:
> >
> > > Hello NiFi developers! I'm new to NiFi and decided to create a
> > > *DetectDuplicateRecord
> > > *processor. Mike Thomsen also created an implementation about the same
> > > time. It was suggested we open this up for discussion with the
> community
> > to
> > > identify use cases.
> > >
> > > Below are the two implementations each with their respective
> properties.
> > >
> > >    - https://issues.apache.org/jira/browse/NIFI-6014
> > >    - *Record Reader*
> > >       - *Record Writer*
> > >       - *Cache Service*
> > >       - *Lookup Record Path:* The record path operation to use for
> > >       generating the lookup key for each record.
> > >       - *Cache Value Strategy:* This determines what will be written to
> > the
> > >       cache from the record. It can be either a literal value or the
> > > result of a
> > >       record path operation.
> > >       - *Cache Value: *This is the value that will be written to the
> > cache
> > >       at the appropriate record and record key if it does not exist.
> > >       - *Don't Send Empty Record Sets: *Same as "Include Zero Record
> > >       FlowFiles" below
> > >
> > >       - https://issues.apache.org/jira/browse/NIFI-6047
> > >    - *Record Reader*
> > >       -
> > > *Record Writer *
> > >       - *Include Zero Record FlowFiles*
> > >       - *Cache The Entry Identifier:* Similar to DetectDuplicate
> > >       - *Distributed Cache Service:* Similar to DetectDuplicate
> > >       - *Age Off Duration:* Similar to DetectDuplicate
> > >       - *Record Hashing Algorithm:* The algorithm used to hash the
> > combined
> > >       result of RecordPath values in the cache.
> > >       - *Filter Type: *The filter used to determine whether a record
> has
> > >       been seen before based on the matching RecordPath criteria
> defined
> > by
> > >       user-defined properties. Current options are *HashSet* or
> > >       *BloomFilter*.
> > >       - *Filter Capacity Hint:* An estimation of the total number of
> > unique
> > >       records to be processed.
> > >       - *BloomFilter Probability:* The desired false positive
> probability
> > >       when using the BloomFilter filter type.
> > >       - *<User Defined Properties>:* The name of the property is a
> record
> > >       path. All record paths are resolved on each record to determine
> > > the unique
> > >       value for a record. The value of the user-defined property is
> > > ignored.
> > >       Initial thought however was to make the value expose field
> > variables
> > > sort
> > >       of how UpdateRecord does (i.e. ${field.value})
> > >
> > >
> > > There are many ways duplicate records could be detected. Offering the
> > user
> > > the ability to:
> > >
> > >    - *Specify the cache identifier* means users can use the same
> > identifier
> > >    in different DetectDuplicateRecord blocks in different process
> groups.
> > >    Specifying a unique name based on the file name for example will
> > > conversely
> > >    isolate the unique check to just the daily load of a specific file.
> > >    - *Set a cache expiration* lets users do things like set it to last
> > for
> > >    24 hours so we only store unique cache information from one day to
> the
> > >    next. This is useful when you are doing a daily file load and you
> only
> > > want
> > >    to process the new records or the records that changed.
> > >    - *Select a filter type* will allow you to optimize for memory
> usage.
> > I
> > >    need to process multi-GB sized files and keeping a hash of each of
> > > those is
> > >    going to get expensive with a HashSet in memory. But offering a
> > > BloomFilter
> > >    is acceptable especially when you are doing database operations
> > > downstream
> > >    and don't care if you have some false positives but it will reduce
> the
> > >    number of attempted duplicate inserts/updates you perform.
> > >
> > >
> > > Here's to hoping this finds you all warm and well. I love this
> software!
> > >
> > >
> > > Adam
> > >
> >
>

--000000000000b7c09105823eebe7--