atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barbara Eckman (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ATLAS-2708) AWS S3 data lake typedefs for Atlas
Date Fri, 15 Jun 2018 18:45:00 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514229#comment-16514229
] 

Barbara Eckman edited comment on ATLAS-2708 at 6/15/18 6:44 PM:
----------------------------------------------------------------

[~bosco] 

 bq. 

You have S3AccessPolicy in AWSS3Bucket as string. In S3, Bucket Policy is a list of Statement
Structure. If we are not using it now, we should probably remove it and add it when we need
to. Or we can create a placeholder S3BucketPolicy entity and associate that with AWSS3Bucket


was (Author: barbara):
[~bosco] 

 bq. You have S3AccessPolicy in AWSS3Bucket as string. In S3, Bucket Policy is a list of
Statement Structure. If we are not using it now, we should probably remove it and add it when
we need to. Or we can create a placeholder S3BucketPolicy entity and associate that with AWSS3Bucket

> AWS S3 data lake typedefs for Atlas
> -----------------------------------
>
>                 Key: ATLAS-2708
>                 URL: https://issues.apache.org/jira/browse/ATLAS-2708
>             Project: Atlas
>          Issue Type: New Feature
>          Components:  atlas-core
>            Reporter: Barbara Eckman
>            Assignee: Barbara Eckman
>            Priority: Critical
>         Attachments: 3010-aws_model.json, all_AWS_common_typedefs.json, all_datalake_typedefs.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It would be
nice to add typedefs for AWS data lake objects (buckets and pseudo-directories) and lineage
processes that move the data from another source (e.g., kafka topic) to the data lake.  For
example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in an
S3 bucket.  For example, in the case of an object with key “myWork/Development/Projects1.xls”, “myWork/Development”
is the pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the pseudo-directory (based
on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of the data
in a bucket to a storageClass after a specific time interval, or expiration.  For example,
transition to GLACIER after 60 days, or expire (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated with an
AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is monitored
by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, TotalRequestLatency,
BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or limit the
monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or its tags or
prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one dataset to
another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message