atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nigel Jones (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ATLAS-1821) Classification propagation from entity to a derivative or child entity
Date Fri, 26 May 2017 11:43:04 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026177#comment-16026177
] 

Nigel Jones edited comment on ATLAS-1821 at 5/26/17 11:42 AM:
--------------------------------------------------------------

Srikanth,
 
I have some questions as to how tag propogation might work in the following scenario
 
For governance purposes I have a security classification "confidentiality"
This can be one of four values - public, internal, confidential, topsecret
 
I would apply it to a Database column ie "location" 
 
 
An asset cannot be classified with two different confidentialities
 
I see a few  approaches to this
 1a: preventing such a relationship being created
 1b: defining precedence when the data is retrieved (consumer centric interfaces might return
only the "winner" whilst repository level generic interfaces might return a full list of relationships).
 The precendence could be based on closeness to the entity and/or characteristics of the classification
values (ie order)
 1c: allowing it and not specifying the meaning - this concerns me as different consumers
may infer different things
 
Also how are we representing this characteristic of the classification?
 
Such a relationship could be created through the repository APIs. Recently there has also
been a jira opened to discuss collections. This could also easily lead to the scenario above
if tag propogation is allowed and IF collections are a first class object & can have classifications
(as opposed to just being used to support set based operations whereby the actual columns
would be updated... Which I'm ok with)
 
 - add entity to a collection with confidentiality=public
 - add entity to a a different  collection with confidentiality=topsecret
 
The same situation exists where classifications are associated with terms, if multiple terms
are associated with the column.. Even if that isn't "likely "from a business perspective we
need defined behaviour.
 
I see in the proposal above that
 * A union of all classifications is always presented. This does seem the simpler approach,
but could lead to a dual classification of topsecret and public in the example above. If so
we need to be aware of this and agree what it means. Purely down to the application (or higher
api in the stack) to resolve?
 * conflict resolution is defined as a manual process. At the API level would this mean APIs
would fail until a conflict is resolved. For example the term association causing the conflicting
propogation would fail? The adding to a collection would fail? 
 
 
David,
 In your example can I check I understand your scenario
  * There is a "national insurance number" glossary term
  * this is classified as "confidential" (presumably using a confidentiality classification
which can be one of the values I listed above)
  * the same "national insurance number" is mapped to two columns.
  * One of these columns is a clear-text representation of the national insurance number
  * One of these columns (so persisted in the database rather than computed at access time)
is a masked form - perhaps just the first two characters
  * you assert we need rules
 
I'm not so sure.... Surely in this case those two columns are different business meanings?
Either I would
 a) Create a different glossary term called "redacted national insurance number" & classify
this as non-confidential
 b) Not even store the redacted form, and allow policies (ie in ranger) to do the masking
at runtime
 
Even this scenario brings up tag propogation questions though... In a) above, should "redacted
national insurance number" be related to "national insurance number". Yes I think it should.
But should it inherit the classification. No. It could as a default of course, but then overriden
to public. This brings us back to the original question of needing to control propogation.

I wonder if all inbound and outbound links need to the ability to
 * Allow outbound propogation"
 * Allow inbound propogation
 
Since both parties in the relationship need to have some say in this
 
Going back to the original point, some classifications may require to be unique, this is an
additional attribute of the classification. Further it will automatically prevent inbound
propogation of other instances of classifications of the same type. 
 
Apologies if I've not used the right terms as per the design doc/implementation, but hopefully
you get the idea :-)




was (Author: jonesn):
Srikanth,

I have some questions as to how tag propogation might work in the following scenario

For governance purposes I have a security classification "confidentiality"
This can be one of four values - public, internal, confidential, topsecret

I would apply it to a Database column ie "location" 


An asset cannot be classified with two different confidentialities

I see a few  approaches to this 1a: preventing such a relationship being created
 1b: defining precedence when the data is retrieved (consumer centric interfaces might return
only the "winner" whilst repository level generic interfaces might return a full list of relationships).
 The precendence could be based on closeness to the entity and/or characteristics of the classification
values (ie order)
 1c: allowing it and not specifying the meaning - this concerns me as different consumers
may infer different things

Also how are we representing this characteristic of the classification?

Such a relationship could be created through the repository APIs. Recently there has also
been a jira opened to discuss collections. This could also easily lead to the scenario above
if tag propogation is allowed and IF collections are a first class object & can have classifications
(as opposed to just being used to support set based operations whereby the actual columns
would be updated... Which I'm ok with)

 - add entity to a collection with confidentiality=public
 - add entity to a a different  collection with confidentiality=topsecret

The same situation exists where classifications are associated with terms, if multiple terms
are associated with the column.. Even if that isn't "likely "from a business perspective we
need defined behaviour.

I see in the proposal above that
 * A union of all classifications is always presented. This does seem the simpler approach,
but could lead to a dual classification of topsecret and public in the example above. If so
we need to be aware of this and agree what it means. Purely down to the application (or higher
api in the stack) to resolve?
 * conflict resolution is defined as a manual process. At the API level would this mean APIs
would fail until a conflict is resolved. For example the term association causing the conflicting
propogation would fail? The adding to a collection would fail? 


David,
 In your example can I check I understand your scenario  * There is a "national insurance
number" glossary term  * this is classified as "confidential" (presumably using a confidentiality
classification which can be one of the values I listed above)
  * the same "national insurance number" is mapped to two columns.  * One of these columns
is a clear-text representation of the national insurance number  * One of these columns (so
persisted in the database rather than computed at access time) is a masked form - perhaps
just the first two characters
  * you assert we need rules

I'm not so sure.... Surely in this case those two columns are different business meanings?
Either I would a) Create a different glossary term called "redacted national insurance number"
& classify this as non-confidential b) Not even store the redacted form, and allow policies
(ie in ranger) to do the masking at runtime

Even this scenario brings up tag propogation questions though... In a) above, should "redacted
national insurance number" be related to "national insurance number". Yes I think it should.
But should it inherit the classification. No. It could as a default of course, but then overriden
to public. This brings us back to the original question of needing to control propogation.I
wonder if all inbound and outbound links need to the ability to * Allow outbound propogation"
* Allow inbound propogation

Since both parties in the relationship need to have some say in this

Going back to the original point, some classifications may require to be unique, this is an
additional attribute of the classification. Further it will automatically prevent inbound
propogation of other instances of classifications of the same type. 

Apologies if I've not used the right terms as per the design doc/implementation, but hopefully
you get the idea :-)


> Classification propagation from entity to a derivative or child entity
> ----------------------------------------------------------------------
>
>                 Key: ATLAS-1821
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1821
>             Project: Atlas
>          Issue Type: Improvement
>          Components:  atlas-core, atlas-webui
>            Reporter: Srikanth Venkat
>             Fix For: 0.9-incubating
>
>
> User Story:
> As a data steward, I need a scalable way to quickly and efficiently propagate classification
across the information supply chain to support efficient searches and classification based
security for compliance and audit purposes. 
> This requires:
> 1. Classifications for derivative entities should be inherited from the originator and
to child entities from parent. 
> For example, if a Hive column is classified "Confidential" then resulting column created
from a CTAS operation should also be tagged "Confidential" to maintain the classification
of the original entity. In the case where 2 or more entities are composed, the derivative
entity should have the union of all classifications of each source entity.
> 2. Business Terms:
> a. Child business terms should inherit the classifications associated with the parent
term.
> b. The option to propagate classification to child business terms in a hierarchy should
be provided
> c. Ability to update the propagated tags manually via UI or through the API
> d. Tagging a term should propagate to data assets that are already attached to that business
term as well
> 3. Data assets
> a. For all supported data asset types in Atlas, if a derivative asset is created it should
inherit the tags and attributes from the original asset.
> b. the option to propagate tags to child entities should be provided (e.g. if you tag
a folder in HDFS optionally tag all the files within it)
> c. Ability to update the propagated tags manually via UI or through the API
> d. Tagging a parent object should be inherited after child creation dynamically (unless
a flag is set not to do this)
> e. Derived data assets should have the tags of the original data asset.
> Conflict resolution - if there are different values for attributes on tags (classifications)
on upstream or parent entities used to derive a data asset then user needs to be prompted
for action to resolve the conflict. Once resolved, the resolved value should be carried forth
to derived assets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message