atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ernie ostic (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ATLAS-1765) Self-Service Catalog Search and Data Preview
Date Tue, 09 May 2017 18:54:04 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000621#comment-16000621
] 

ernie ostic edited comment on ATLAS-1765 at 5/9/17 6:53 PM:
------------------------------------------------------------

Initial thoughts on search and query use types, based on various use cases that are often
seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below are loose, but
and correspond to the user's objective and also their level of experience with the tool, and
whether they are in the role of "governance team" vs "regular enterprise user".   Breaking
them up here just to aid with further discussion.     Each of these comes up in three "access"
modes, fairly equally: (1) online, using the gui  (2) batch, via command line, for extraction
to a .csv or other export file structure (3) via REST api.    Each also typically allows a
list of "properties" to be simply selected along with said "entity" (name, description, internal
identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but each should
be easily backed into a personna or business use case as necessary. 

Note:  It is expected that most of the search use cases below are (nearly always) restricted
by some definition of "scope" --- a "department", an "owner"....a particular database or region,
a specific schema, a "category" of Terms, etc.



Governance Queries

These are queries that are typically done by the "governance team" or with management responsibility
for the governance project -- and are often the chassis for a governance dashboard or other
reporting system that measures progress towards governance objectives. A site may have a target
that "all entities in the data lake" be fully governed by "a future target date".  These kinds
of queries help measure and move governance teams closer to that goal.  

The "list all" kinds of queries are also often the source of a "count" so that results can
be graphed on the dashboard.  Searches like (7) below are typically "validation" queries,
sometimes issued in real-time to enforce completeness in the repository, or perhaps nightly
to reject content or force "re-review" for things that are incomplete.    The same "list all"
kinds of queries often feed resulting answer sets into a batch update tool that might perform
assignments or property updates in bulk, perhaps in an offline window.

Ultimately, measuring progress and completeness for entity definitions and their relationships
is about maximizing the value of the repository --- from providing more-easily-understood
names for technical entities, through establishing responsibility for enterprise entities
and their data quality.

1.  List out all entities (usually columns) that have not yet been assigned a Term
2.  List out all entities (usually columns) that have not yet been assigned a Steward
3.  List out all entities (any kind) that are being managed by Steward <steward>
4.  List out all entities that have been modified since <date>
5.  List out all entities that are in a particular state (such as "Draft", where "workflow"
in IGC has been implemented)
6.  List out all entities based on their time remaining in a particular state ("all terms
in draft for more than <n> days")
7.  List out all entities (usually Terms) where property <property> is null   [similar
to where property <property> is <value> but called out here specifically because
it is a common "management level" governance query
8.  List out all entities (usually Terms) where relationship <relationship> is null
 



 Research Queries

Often by an individual user, data research person....sometimes also performed by developers,
often exploiting a "lineage" relationship --- often with a particular goal in mind for "that
entity" or "that steward" --- such as finding the lineage for "one particular" report, or
process.

9.  List out all entities <a specific type> where property <property> is <value>
[string, between, equal_to, etc., etc. etc. ]
10.  List out all entities <relationship, such as "owned by"> <steward> 
11.  Show all entities "written by" <name of process or other data-mover kind of asset>.
  For Atlas in its current form, this might the name of a SQOOP process
12.  For <entity> (type and name), show immediate upstream entity (and properties of
that entity...last time it ran, status code, etc.)
13.  Show a Term and all of its "history" (particularly important for comments by reviewers
over time)
14.  Various complex "set" retrievals, qualified by existence of a particular instance...such
as "dump out all database/table/column details for every database that contains a schema called
<schemaName> [at times, the qualifier is just "if it exists" as a child but still dump
all children....possibly requiring multilple requests or additional filtering against the
final returned list]
15.  List all transformations (and their sources/targets/processes) where nullability was
changed for a column from null to "not null".   [that is a specific example, but could exist
for datatype changes, column name changes, specific mappings or functions, etc.
16.  Requests that exploit multiple relationships for qualification...such as "list all tables
that have a Steward...but only for Stewards who also manage/own assets in the Risk Collection"





was (Author: eostic):
Initial thoughts on search and query use types, based on various use cases that are often
seen with Infosphere Information Governance Catalog (IGC).

Search/Queries against the repository.   

Here are various patterns we see frequently with IGC.   The categories below are loose, but
and correspond to the user's objective and also their level of experience with the tool, and
whether they are in the role of "governance team" vs "regular enterprise user".   Breaking
them up here just to aid with further discussion.     Each of these comes up in three "access"
modes, fairly equally: (1) online, using the gui  (2) batch, via command line, for extraction
to a .csv or other export file structure (3) via REST api.    Each also typically allows a
list of "properties" to be simply selected along with said "entity" (name, description, internal
identifier, date created, etc.)   

This is more a listing of "syntax examples" than pure business use cases, but each should
be easily backed into a personna or business use case as necessary. 


Governance Queries

These are queries that are typically done by the "governance team" or with management responsibility
for the governance project -- and are often the chassis for a governance dashboard or other
reporting system that measures progress towards governance objectives. A site may have a target
that "all entities in the data lake" be fully governed by "a future target date".  These kinds
of queries help measure and move governance teams closer to that goal.  

The "list all" kinds of queries are also often the source of a "count" so that results can
be graphed on the dashboard.  Searches like (7) below are typically "validation" queries,
sometimes issued in real-time to enforce completeness in the repository, or perhaps nightly
to reject content or force "re-review" for things that are incomplete.    The same "list all"
kinds of queries often feed resulting answer sets into a batch update tool that might perform
assignments or property updates in bulk, perhaps in an offline window.

Ultimately, measuring progress and completeness for entity definitions and their relationships
is about maximizing the value of the repository --- from providing more-easily-understood
names for technical entities, through establishing responsibility for enterprise entities
and their data quality.

1.  List out all entities (usually columns) that have not yet been assigned a Term
2.  List out all entities (usually columns) that have not yet been assigned a Steward
3.  List out all entities (any kind) that are being managed by Steward <steward>
4.  List out all entities that have been modified since <date>
5.  List out all entities that are in a particular state (such as "Draft", where "workflow"
in IGC has been implemented)
6.  List out all entities based on their time remaining in a particular state ("all terms
in draft for more than <n> days")
7.  List out all entities (usually Terms) where property <property> is null   [similar
to where property <property> is <value> but called out here specifically because
it is a common "management level" governance query
8.  List out all entities (usually Terms) where relationship <relationship> is null
 



 Research Queries

Often by an individual user, data research person....sometimes also performed by developers,
often exploiting a "lineage" relationship --- often with a particular goal in mind for "that
entity" or "that steward" --- such as finding the lineage for "one particular" report, or
process.

9.  List out all entities <a specific type> where property <property> is <value>
[string, between, equal_to, etc., etc. etc. ]
10.  List out all entities <relationship, such as "owned by"> <steward> 
11.  Show all entities "written by" <name of process or other data-mover kind of asset>.
  For Atlas in its current form, this might the name of a SQOOP process
12.  For <entity> (type and name), show immediate upstream entity (and properties of
that entity...last time it ran, status code, etc.)
13.  Show a Term and all of its "history" (particularly important for comments by reviewers
over time)
14.  Various complex "set" retrievals, qualified by existence of a particular instance...such
as "dump out all database/table/column details for every database that contains a schema called
<schemaName> [at times, the qualifier is just "if it exists" as a child but still dump
all children....possibly requiring multilple requests or additional filtering against the
final returned list]
15.  List all transformations (and their sources/targets/processes) where nullability was
changed for a column from null to "not null".   [that is a specific example, but could exist
for datatype changes, column name changes, specific mappings or functions, etc.
16.  Requests that exploit multiple relationships for qualification...such as "list all tables
that have a Steward...but only for Stewards who also manage/own assets in the Risk Collection"




> Self-Service Catalog Search and Data Preview
> --------------------------------------------
>
>                 Key: ATLAS-1765
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1765
>             Project: Atlas
>          Issue Type: New Feature
>          Components: atlas-webui
>    Affects Versions: 0.9-incubating
>            Reporter: Mandy Chessell
>            Assignee: Mandy Chessell
>              Labels: Self-Service-UIs, VirtualDataConnector
>
> This JIRA covers the development of the catalog search and preview of data for data scientists
and business users.  It supports the search of the Atlas metadata repository, display of search
results, additional filtering and drill down into details of the data sources, including a
data preview option if the end user has access permission.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message