nifi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emanuel Oliveira <emanu...@gmail.com>
Subject Re: Provenance Repository and GDPR
Date Thu, 30 Jan 2020 16:51:06 GMT
But enlight me please :) isnt GDPR just about cleaning from persistent
storage ?
In what sense does NiFi relates to GDPR compliance ?

   - in terms of data FF contents - they too transient (gone in 12hours /
   default).
   - I guess discussion is on the fact FF attributes are kept on the data
   provenance repo ? (gone in 24h / default)

I wonder wheres the culprit here ? Is it in the situation hwere one wants
to keep a long trace of data provenance like 6 months, but because
attributes are stored on provenance events, then they must be deleted ?
I guess it can only be a problem of deleting attributes from provenance
repo and no FF contents right as they gone fast enough ?

Best Regards,
*Emanuel Oliveira*



On Thu, Jan 30, 2020 at 4:42 PM Mike Thomsen <mikerthomsen@gmail.com> wrote:

> > It was created on this side of the Atlantic because when people do care
> about such things - they REALLY care.
>
> Agreed. I was just commenting on our particular experiences with customers
> in the federal space. There are unfortunately many who still don't get all
> of the accountability traceability advantages provenance and lineage
> tracking provides.
>
> On Thu, Jan 30, 2020 at 10:32 AM Joe Witt <joe.witt@gmail.com> wrote:
>
> > Mike,
> >
> > It was created on this side of the Atlantic because when people do care
> > about such things - they REALLY care.
> >
> > I anticipate more and more people will care and I hope that day comes
> > soon.  I'm proud of NiFi's ability to be a leader here because if your
> flow
> > management solution between sensors and processing and storage systems
> > tells you where things came from and went to it is a heck of a good
> start.
> >
> > What exists in our provenance data is information about the data but this
> > can be 'any attribute' put on a flow file throughout its life in the
> flow.
> > We simply cannot guarantee this wont be 'content'.  The notion of what is
> > metadata vs content gets blurry fast.
> >
> > Uwe,
> >
> > The data provenance capabilities within NiFi do no support the ability to
> > 'delete records' based on specified parameters.  The only mechanism is
> > space or time based age off.  For now, whatever the obligation is to
> > respond to a right to be forgotten request should be what the provenance
> > within NiFi is configured to hold.  If for instance you have 24 hours
> then
> > provenance in NiFi should hold no more than 24 hours.
> >
> > I doubt this is something we'll be able to spend time on sooner but I
> agree
> > the idea of being able to purge out records is a good one based on more
> > precise parameters.
> >
> > The intent is not that the built-in nifi provenance store is for long
> term
> > but rather the records are there long enough to support flow management
> use
> > cases but are always being exported to a long term store such as Atlas or
> > even just stored in HDFS or other locations for additional use.  One
> > day...a sweet graph database...
> >
> > Thanks
> > Joe
> >
> > On Thu, Jan 30, 2020 at 10:29 AM Emanuel Oliveira <emanueol@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Some recap on NiFi concepts:
> > >
> > >    - Content Repository stores FF contents.
> > >    - Data Provenance events -used to check lineage of history of FFs-
> > only
> > >    stores pointers to FFs (not contents).
> > >    - so one can have data deleted and still access lineage/data
> > provenance
> > >    history.
> > >
> > > Heres a lof of in-depth on the subject, but above 3 points are the
> > > summary of all:
> > > https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> > >
> > >
> > > *DATA - persistent data only exists in 2 scenarios:*
> > >
> > >    - while your flow file running.
> > >    - archived on content repository for 12h (to allow access contents
> > when
> > >    using inspect data provenance/lineage).
> > >
> > >
> >
> https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418
> > >
> > >
> > > *PROVENANCE EVENTS (LINEAGE) OF DATA:*
> > >
> > >    - contains only provenance attributes and FF uuid etcbut NO
> CONTENTS,
> > >    available for 24h unless increasing/changed on config files.
> > >    -
> > >
> > >
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> > >
> > >
> > >
> > > So as you see both context by default expire daily. fast enough that
> dont
> > > think GDPR is any problem or any action needed.
> > > Now one can always boosts retention of just data provenance events for
> > > months, 1 year or whatever suits. But data is long gone anyway.
> > >
> > > Best Regards,
> > > *Emanuel Oliveira*
> > >
> > >
> > >
> > > On Thu, Jan 30, 2020 at 2:26 PM Uwe@Moosheimer.com <Uwe@moosheimer.com
> >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > > GDPR doesnt need milisecond realtime deletion right ?)
> > > > right.
> > > >
> > > > > since inbound FFs have
> > > > >    normally hundreds, thousands of records that will need to split,
> > > > aggregate,
> > > > >    in complex flow file, implementing a clean
> > > > It depends on your application. Not everyone uses NiFi for IoT and
> > > > therefore a single record may be included.
> > > >
> > > > > In my opinion your answer to business/management gate keepers is
> that
> > > > data
> > > > > will be stored on data provenance for 24h (default) which can be
> > > > > configured, and that
> > > >
> > > > This is not necessarily the point of the Data Lineage, that the
> > > > information is deleted after 24 hours (or whatever is configured).
> > > > If Data Lineage is needed (revision, legal requirements etc.), then
> > > > deleting the data after a defined time is not an option.
> > > >
> > > > This is the reason why Atlas supports it.
> > > >
> > > > Best Regards,
> > > > Uwe
> > > >
> > > > Am 30.01.2020 um 15:06 schrieb Emanuel Oliveira:
> > > > > Hi, dont think makes sense an api for atomic records:
> > > > >
> > > > >    1. one configure retention od data provenance (default 24h is
> > "good
> > > > >    enough" GDPR doesnt need milisecond realtime deletion right ?)
> > > > >
> > > >
> > >
> >
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#persistent-provenance-repository-properties
> > > > >    2. even if there would be one api to delete FF's with an
> > attribute =
> > > > >    <some id>, that would normally be useless as well, since
inbound
> > FFs
> > > > have
> > > > >    normally hundreds, thousands of records that will need to split,
> > > > aggregate,
> > > > >    in complex flow file, implementing a clean up an nano atomic
> level
> > > > would be
> > > > >    to hard and extra effort not needed, since your target single
> > record
> > > > would
> > > > >    surely be part of multiple FF UUIDs, some only holding your
> > record,
> > > > but mot
> > > > >    surefly will have 100s, 100s of other records including your
> > record
> > > > >    somewhere on the middle.
> > > > >
> > > > >
> > > > > In my opinion your answer to business/management gate keepers is
> that
> > > > data
> > > > > will be stored on data provenance for 24h (default) which can be
> > > > > configured, and that
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > *Emanuel Oliveira*
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jan 30, 2020 at 1:54 PM Uwe@Moosheimer.com <
> > Uwe@moosheimer.com
> > > >
> > > > > wrote:
> > > > >
> > > > >> Dear NiFi developer team,
> > > > >>
> > > > >> NiFi's Data Provenance and Data Lineage is perfectly adequate
in
> the
> > > > >> environment of NiFi, so there is often no need to use Atlas.
> > > > >>
> > > > >> When using NiFi with customer data a problem arises.
> > > > >> The problem is the GDPR requirement that a user has the right
to
> be
> > > > >> forgotten. Unfortunately, I can't find any API call or information
> > on
> > > > >> how to delete individual user data from the NiFi Provenance
> > Repository
> > > > >> based on a user-defined attribute and its defined characteristics.
> > > > >>
> > > > >> A delete request like "delete all data and dependencies where
the
> > > > >> attribute XYZ has the value 123" is currently not possible to
my
> > > > knowledge.
> > > > >>
> > > > >> My questions are:
> > > > >> Is this actually possible and how? And if not, is it planned?
> > > > >>
> > > > >> Thanks
> > > > >> Uwe
> > > > >>
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message