From dev-return-18792-archive-asf-public=cust-asf.ponee.io@nifi.apache.org Tue Feb 19 13:07:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8371F18060E for ; Tue, 19 Feb 2019 14:07:24 +0100 (CET) Received: (qmail 43674 invoked by uid 500); 19 Feb 2019 13:07:23 -0000 Mailing-List: contact dev-help@nifi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@nifi.apache.org Delivered-To: mailing list dev@nifi.apache.org Received: (qmail 43660 invoked by uid 99); 19 Feb 2019 13:07:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Feb 2019 13:07:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7C0F5C0313 for ; Tue, 19 Feb 2019 13:07:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.798 X-Spam-Level: * X-Spam-Status: No, score=1.798 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id xORsXhCIlNJh for ; Tue, 19 Feb 2019 13:07:20 +0000 (UTC) Received: from mail-ot1-f53.google.com (mail-ot1-f53.google.com [209.85.210.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 91583610E9 for ; Tue, 19 Feb 2019 13:07:19 +0000 (UTC) Received: by mail-ot1-f53.google.com with SMTP id m1so34011549otf.5 for ; Tue, 19 Feb 2019 05:07:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=9xdgGRidJe6YJuf5rEZ6eA0d/BFrGAMTAE2h15zPsiQ=; b=go7TiHbdI4y0S4r0l77R2ZWenTPA017Bu9pF/pukbiHgdmQPWjizG0ICi/IC4Wblln c4N+zWSig3yLIvC47A7y27uEAIUL/5vwJdbfi3DKo2+jqTU6A+TjutJrRh7QVFSDrUf+ mhrgXuP4nC06rW4B8IonVCYyTkGVPmsIX4gjTSNOhW27MFYAl2+wXxvQ01fKzo54knt4 k+UP7PuhJA9t/cFT1m7uv/2bfF0nbtZZYXM2VKYeaiDiI6evg/rp2VlMSkpRkBzBiSK7 qMcOehOvrUd9P4mu03fKTDvoeT90GcsJ5kIANMWDukdBruW3/IuJ99BQRx/MQXbdqWmn pnSw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=9xdgGRidJe6YJuf5rEZ6eA0d/BFrGAMTAE2h15zPsiQ=; b=grZamLRPz4sbg3cSZt7+pcrzPLt22JTuZX2YjcTG8/SFOPGajbG3ywgm9otfDhe83r k6fDqIr1KnXnKDDGSVVAJhazGoOJLhrDIfqPfI5L0C8EX1sI8oF0Yl3gqy/fcJUqPeZR 28GgGsX1YJnzNNJvQAMqoMCA3sZ92jTzWZttOodnxfN9oNwpkgYNuuLlFBRFHJoup6bn rHUQW4pls/hCD/QLEe6HkvFFQYAWqksfmRaFAEeqgIzMPJWh55FRv0ZF9u9c17sjrndb xWJaAPCGPFb9FSPtmGZcBX0g0c/Jsl/lHNJHF9XJBIEc7NAmY+g+YvwQQgd/GxfABF7b cxeg== X-Gm-Message-State: AHQUAuamOJRsVmU/67MrQyY6+0cgIXlRuwoTOubeNIvEQhNHMfgI/wo/ qqZueR3HGSOwJBVQblDpQ+8+foZgiW4dgse/Z+q9Xw== X-Google-Smtp-Source: AHgI3IZ0FH9e6qE+pm1CSG3PWfj5YmyEr69BkHI29OF3+8/daQcXoXWpca/S905p+DdDA5DUQ5dx3zwGZBXQuel57WQ= X-Received: by 2002:a54:4e18:: with SMTP id a24mr2409327oiy.130.1550581637823; Tue, 19 Feb 2019 05:07:17 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Mike Thomsen Date: Tue, 19 Feb 2019 08:07:06 -0500 Message-ID: Subject: Re: DetectDuplicateRecord Processor To: dev@nifi.apache.org Content-Type: multipart/alternative; boundary="000000000000b7c09105823eebe7" --000000000000b7c09105823eebe7 Content-Type: text/plain; charset="UTF-8" I'll have to look at Adam's code in more depth, but I think one reason we might need two is that I didn't see any ability to just check an existing record path against the cache and call it a day. For teams using a standard UUID scheme, that's all we'd need or want. Could be wrong abut that and Adam please let me know if I am. On Tue, Feb 19, 2019 at 7:28 AM Joe Witt wrote: > Mike, Adam, > > It appears the distinction of interest here between the two general > approaches is less about in-mem vs map cache and instead is more about > approximate/fast detection vs certain/depending on size of cache > approaches. > > I'm not sure if this is quite right or if the distinction warrants two > processors but this is my first impression. > > But it is probably best if the two of you, as contributors to this problem, > discuss and find consensus. > > Thanks > > On Sat, Feb 16, 2019 at 9:33 PM Mike Thomsen > wrote: > > > Thanks, Adam. The use case I had, in stereotypical agile fashion could be > > summarized like this: > > > > "As a NiFi user, I want to be able to generate UUIDv5 IDs for all of my > > record sets and then have a downstream processor check each generated > UUID > > against the existing ingested data to see if there is an existing row > with > > that UUID." > > > > For us, at least, false positives are something that we would need to be > > fairly aggressive in preventing. > > > > One possibility here is that we split the difference with your > contribution > > being an in-memory deduplicator and mine going purely against a > distributed > > map cache client. I think there might be enough ground to cover that we > > might want to have two approaches to this problem that specialize rather > > than a one-size-fits-most single solution. > > > > Thanks, > > > > Mike > > > > On Sat, Feb 16, 2019 at 9:18 PM Adam Fisher > wrote: > > > > > Hello NiFi developers! I'm new to NiFi and decided to create a > > > *DetectDuplicateRecord > > > *processor. Mike Thomsen also created an implementation about the same > > > time. It was suggested we open this up for discussion with the > community > > to > > > identify use cases. > > > > > > Below are the two implementations each with their respective > properties. > > > > > > - https://issues.apache.org/jira/browse/NIFI-6014 > > > - *Record Reader* > > > - *Record Writer* > > > - *Cache Service* > > > - *Lookup Record Path:* The record path operation to use for > > > generating the lookup key for each record. > > > - *Cache Value Strategy:* This determines what will be written to > > the > > > cache from the record. It can be either a literal value or the > > > result of a > > > record path operation. > > > - *Cache Value: *This is the value that will be written to the > > cache > > > at the appropriate record and record key if it does not exist. > > > - *Don't Send Empty Record Sets: *Same as "Include Zero Record > > > FlowFiles" below > > > > > > - https://issues.apache.org/jira/browse/NIFI-6047 > > > - *Record Reader* > > > - > > > *Record Writer * > > > - *Include Zero Record FlowFiles* > > > - *Cache The Entry Identifier:* Similar to DetectDuplicate > > > - *Distributed Cache Service:* Similar to DetectDuplicate > > > - *Age Off Duration:* Similar to DetectDuplicate > > > - *Record Hashing Algorithm:* The algorithm used to hash the > > combined > > > result of RecordPath values in the cache. > > > - *Filter Type: *The filter used to determine whether a record > has > > > been seen before based on the matching RecordPath criteria > defined > > by > > > user-defined properties. Current options are *HashSet* or > > > *BloomFilter*. > > > - *Filter Capacity Hint:* An estimation of the total number of > > unique > > > records to be processed. > > > - *BloomFilter Probability:* The desired false positive > probability > > > when using the BloomFilter filter type. > > > - *:* The name of the property is a > record > > > path. All record paths are resolved on each record to determine > > > the unique > > > value for a record. The value of the user-defined property is > > > ignored. > > > Initial thought however was to make the value expose field > > variables > > > sort > > > of how UpdateRecord does (i.e. ${field.value}) > > > > > > > > > There are many ways duplicate records could be detected. Offering the > > user > > > the ability to: > > > > > > - *Specify the cache identifier* means users can use the same > > identifier > > > in different DetectDuplicateRecord blocks in different process > groups. > > > Specifying a unique name based on the file name for example will > > > conversely > > > isolate the unique check to just the daily load of a specific file. > > > - *Set a cache expiration* lets users do things like set it to last > > for > > > 24 hours so we only store unique cache information from one day to > the > > > next. This is useful when you are doing a daily file load and you > only > > > want > > > to process the new records or the records that changed. > > > - *Select a filter type* will allow you to optimize for memory > usage. > > I > > > need to process multi-GB sized files and keeping a hash of each of > > > those is > > > going to get expensive with a HashSet in memory. But offering a > > > BloomFilter > > > is acceptable especially when you are doing database operations > > > downstream > > > and don't care if you have some false positives but it will reduce > the > > > number of attempted duplicate inserts/updates you perform. > > > > > > > > > Here's to hoping this finds you all warm and well. I love this > software! > > > > > > > > > Adam > > > > > > --000000000000b7c09105823eebe7--