nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy LoPresto <alopre...@apache.org>
Subject Re: Proposal: standard record metadata attributes for data sources
Date Mon, 14 May 2018 22:11:17 GMT
I would actually propose that this is added to the provenance but not always put into the flowfile
attributes. There are many scenarios in which the data retrieval should be separated from
the analysis/follow-on, both for visibility, responsibility, and security concerns. While
I understand a separate UpdateAttribute processor could be put in the downstream flow to remove
these attributes, I would push for not adding them by default as a more secure approach. Perhaps
this could be configurable on the Get* processor via a boolean property, but I think doing
it automatically by default introduces some serious concerns.


Andy LoPresto
alopresto@apache.org
alopresto.apache@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On May 13, 2018, at 11:48 AM, Mike Thomsen <mikerthomsen@gmail.com> wrote:
> 
> @Joe @Matt
> 
> This is kinda related to the point that Joe made in the graph DB thread
> about provenance. My thought here was that we need some standards on
> enriching the metadata about what was fetched so that no matter how you
> store the provenance, you can find some way to query it for questions like
> when a data set was loaded into NiFi, how many records went through a
> terminating processor, etc. IMO this could help batch-oriented
> organizations feel more at ease with something stream-oriented like NiFi.
> 
> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <mikerthomsen@gmail.com> wrote:
> 
>> I'd like to propose that all non-deprecated (or likely to be deprecated)
>> Get/Fetch/Query processors get a standard convention for attributes that
>> describe things like:
>> 
>> 1. Source system.
>> 2. Database/table/index/collection/etc.
>> 3. The lookup criteria that was used (similar to the "query attribute"
>> some already have).
>> 
>> Using GetMongo as an example, it would add something like this:
>> 
>> source.url=mongodb://localhost:27017
>> source.database=testdb
>> source.collection=test_collection
>> source.query={ "username": "john.smith" }
>> source.criteria.username=john.smith //GetMongo would parse the query and
>> add this.
>> 
>> We have a use case where a team is coming from an extremely batch-oriented
>> view and really wants to know when "dataset X" was run. Our solution was to
>> extract that from the result set because the dataset name is one of the
>> fields in the JSON body.
>> 
>> I think this would help expand what you can do out of the box with
>> provenance tracking because it would provide a lot of useful information
>> that could be stored in Solr or ES and then queried against terminating
>> processors' DROP events to get a solid window into when jobs were run
>> historically.
>> 
>> Thoughts?
>> 


Mime
View raw message