nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Witt <joe.w...@gmail.com>
Subject Re: DetectDuplicateRecord Processor
Date Tue, 19 Feb 2019 12:27:43 GMT
Mike, Adam,

It appears the distinction of interest here between the two general
approaches is less about in-mem vs map cache and instead is more about
approximate/fast detection vs certain/depending on size of cache approaches.

I'm not sure if this is quite right or if the distinction warrants two
processors but this is my first impression.

But it is probably best if the two of you, as contributors to this problem,
discuss and find consensus.

Thanks

On Sat, Feb 16, 2019 at 9:33 PM Mike Thomsen <mikerthomsen@gmail.com> wrote:

> Thanks, Adam. The use case I had, in stereotypical agile fashion could be
> summarized like this:
>
> "As a NiFi user, I want to be able to generate UUIDv5 IDs for all of my
> record sets and then have a downstream processor check each generated UUID
> against the existing ingested data to see if there is an existing row with
> that UUID."
>
> For us, at least, false positives are something that we would need to be
> fairly aggressive in preventing.
>
> One possibility here is that we split the difference with your contribution
> being an in-memory deduplicator and mine going purely against a distributed
> map cache client. I think there might be enough ground to cover that we
> might want to have two approaches to this problem that specialize rather
> than a one-size-fits-most single solution.
>
> Thanks,
>
> Mike
>
> On Sat, Feb 16, 2019 at 9:18 PM Adam Fisher <fisher1987@gmail.com> wrote:
>
> > Hello NiFi developers! I'm new to NiFi and decided to create a
> > *DetectDuplicateRecord
> > *processor. Mike Thomsen also created an implementation about the same
> > time. It was suggested we open this up for discussion with the community
> to
> > identify use cases.
> >
> > Below are the two implementations each with their respective properties.
> >
> >    - https://issues.apache.org/jira/browse/NIFI-6014
> >    - *Record Reader*
> >       - *Record Writer*
> >       - *Cache Service*
> >       - *Lookup Record Path:* The record path operation to use for
> >       generating the lookup key for each record.
> >       - *Cache Value Strategy:* This determines what will be written to
> the
> >       cache from the record. It can be either a literal value or the
> > result of a
> >       record path operation.
> >       - *Cache Value: *This is the value that will be written to the
> cache
> >       at the appropriate record and record key if it does not exist.
> >       - *Don't Send Empty Record Sets: *Same as "Include Zero Record
> >       FlowFiles" below
> >
> >       - https://issues.apache.org/jira/browse/NIFI-6047
> >    - *Record Reader*
> >       -
> > *Record Writer *
> >       - *Include Zero Record FlowFiles*
> >       - *Cache The Entry Identifier:* Similar to DetectDuplicate
> >       - *Distributed Cache Service:* Similar to DetectDuplicate
> >       - *Age Off Duration:* Similar to DetectDuplicate
> >       - *Record Hashing Algorithm:* The algorithm used to hash the
> combined
> >       result of RecordPath values in the cache.
> >       - *Filter Type: *The filter used to determine whether a record has
> >       been seen before based on the matching RecordPath criteria defined
> by
> >       user-defined properties. Current options are *HashSet* or
> >       *BloomFilter*.
> >       - *Filter Capacity Hint:* An estimation of the total number of
> unique
> >       records to be processed.
> >       - *BloomFilter Probability:* The desired false positive probability
> >       when using the BloomFilter filter type.
> >       - *<User Defined Properties>:* The name of the property is a record
> >       path. All record paths are resolved on each record to determine
> > the unique
> >       value for a record. The value of the user-defined property is
> > ignored.
> >       Initial thought however was to make the value expose field
> variables
> > sort
> >       of how UpdateRecord does (i.e. ${field.value})
> >
> >
> > There are many ways duplicate records could be detected. Offering the
> user
> > the ability to:
> >
> >    - *Specify the cache identifier* means users can use the same
> identifier
> >    in different DetectDuplicateRecord blocks in different process groups.
> >    Specifying a unique name based on the file name for example will
> > conversely
> >    isolate the unique check to just the daily load of a specific file.
> >    - *Set a cache expiration* lets users do things like set it to last
> for
> >    24 hours so we only store unique cache information from one day to the
> >    next. This is useful when you are doing a daily file load and you only
> > want
> >    to process the new records or the records that changed.
> >    - *Select a filter type* will allow you to optimize for memory usage.
> I
> >    need to process multi-GB sized files and keeping a hash of each of
> > those is
> >    going to get expensive with a HashSet in memory. But offering a
> > BloomFilter
> >    is acceptable especially when you are doing database operations
> > downstream
> >    and don't care if you have some false positives but it will reduce the
> >    number of attempted duplicate inserts/updates you perform.
> >
> >
> > Here's to hoping this finds you all warm and well. I love this software!
> >
> >
> > Adam
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message