nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Thomsen <mikerthom...@gmail.com>
Subject Re: Proposal: standard record metadata attributes for data sources
Date Mon, 14 May 2018 22:32:50 GMT
Does the provenance system have the ability to add user-defined key/value
pairs to a flowfile's provenance record at a particular processor?

On Mon, May 14, 2018 at 6:11 PM Andy LoPresto <alopresto@apache.org> wrote:

> I would actually propose that this is added to the provenance but not
> always put into the flowfile attributes. There are many scenarios in which
> the data retrieval should be separated from the analysis/follow-on, both
> for visibility, responsibility, and security concerns. While I understand a
> separate UpdateAttribute processor could be put in the downstream flow to
> remove these attributes, I would push for not adding them by default as a
> more secure approach. Perhaps this could be configurable on the Get*
> processor via a boolean property, but I think doing it automatically by
> default introduces some serious concerns.
>
>
> Andy LoPresto
> alopresto@apache.org
> *alopresto.apache@gmail.com <alopresto.apache@gmail.com>*
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On May 13, 2018, at 11:48 AM, Mike Thomsen <mikerthomsen@gmail.com> wrote:
>
> @Joe @Matt
>
> This is kinda related to the point that Joe made in the graph DB thread
> about provenance. My thought here was that we need some standards on
> enriching the metadata about what was fetched so that no matter how you
> store the provenance, you can find some way to query it for questions like
> when a data set was loaded into NiFi, how many records went through a
> terminating processor, etc. IMO this could help batch-oriented
> organizations feel more at ease with something stream-oriented like NiFi.
>
> On Fri, Apr 13, 2018 at 4:01 PM Mike Thomsen <mikerthomsen@gmail.com>
> wrote:
>
> I'd like to propose that all non-deprecated (or likely to be deprecated)
> Get/Fetch/Query processors get a standard convention for attributes that
> describe things like:
>
> 1. Source system.
> 2. Database/table/index/collection/etc.
> 3. The lookup criteria that was used (similar to the "query attribute"
> some already have).
>
> Using GetMongo as an example, it would add something like this:
>
> source.url=mongodb://localhost:27017
> source.database=testdb
> source.collection=test_collection
> source.query={ "username": "john.smith" }
> source.criteria.username=john.smith //GetMongo would parse the query and
> add this.
>
> We have a use case where a team is coming from an extremely batch-oriented
> view and really wants to know when "dataset X" was run. Our solution was to
> extract that from the result set because the dataset name is one of the
> fields in the JSON body.
>
> I think this would help expand what you can do out of the box with
> provenance tracking because it would provide a lot of useful information
> that could be stored in Solr or ES and then queried against terminating
> processors' DROP events to get a solid window into when jobs were run
> historically.
>
> Thoughts?
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message