atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barbara Eckman (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ATLAS-2708) AWS S3 data lake typedefs for Atlas
Date Wed, 19 Sep 2018 19:17:00 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-2708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621077#comment-16621077
] 

Barbara Eckman edited comment on ATLAS-2708 at 9/19/18 7:16 PM:
----------------------------------------------------------------

[~toopt4]   It doesn't do it automatically through a listener like the hive hook.  We do
it via lambda functions, triggered, say, on the creation of S3 object or pseudodirectory or
bucket.  We package up the info into AtlasEntities and then publish to the ATLAS_HOOK kafka
topic.


was (Author: barbara):
It doesn't do it automatically through a listener like the hive hook.  We do it via lambda
functions, triggered, say, on the creation of S3 object or pseudodirectory or bucket.  We
package up the info into AtlasEntities and then publish to the ATLAS_HOOK kafka topic.

> AWS S3 data lake typedefs for Atlas
> -----------------------------------
>
>                 Key: ATLAS-2708
>                 URL: https://issues.apache.org/jira/browse/ATLAS-2708
>             Project: Atlas
>          Issue Type: New Feature
>          Components:  atlas-core
>            Reporter: Barbara Eckman
>            Assignee: Barbara Eckman
>            Priority: Critical
>             Fix For: 1.1.0, 2.0.0
>
>         Attachments: 3010-aws_model.json, ATLAS-2708-2.patch, ATLAS-2708.patch, all_AWS_common_typedefs.json,
all_AWS_common_typedefs_v2.json, all_datalake_typedefs.json, all_datalake_typedefs_v2.json
>
>
> Currently the base types in Atlas do not include AWS data lake objects. It would be
nice to add typedefs for AWS data lake objects (buckets and pseudo-directories) and lineage
processes that move the data from another source (e.g., kafka topic) to the data lake.  For
example:
>  * AWSS3PseudoDir type represents the pseudo-directory “prefix” of objects in an
S3 bucket.  For example, in the case of an object with key “myWork/Development/Projects1.xls”, “myWork/Development”
is the pseudo-directory.  It supports:
>  ** Array of avro schemas that are associated with the data in the pseudo-directory (based
on Avro schema extensions outlined in ATLAS-2694)
>  ** what type of data it contains, e.g., avro, json, unstructured
>  ** time of creation
>  * AWSS3BucketLifeCycleRule type represents a rule specifying a transition of the data
in a bucket to a storageClass after a specific time interval, or expiration.  For example,
transition to GLACIER after 60 days, or expire (i.e. be deleted) after 90 days:
>  ** ruleType (e.g., transition or expiration)
>  ** time interval in days before rule is executed  
>  ** storageClass to which the data is transitioned (null if ruleType is expiration)
>  * AWSTag type represents a tag-value pair created by the user and associated with an
AWS object.
>  **  tag
>  ** value
>  * AWSCloudWatchMetric type represents a storage or request metric that is monitored
by AWS CloudWatch and can be configured for a bucket
>  ** metricName, for example, “AllRequests”, “GetRequests”, TotalRequestLatency,
BucketSizeBytes
>  ** scope: null if entire bucket; otherwise, the prefixes/tags that filter or limit the
monitoring of the metric.
>  * AWSS3Bucket type represents a bucket in an S3 instance.  It supports:
>  ** Array of AWSS3PseudoDirectories that are associated with objects stored in the bucket 
>  ** AWS region
>  ** IsEncrypted (boolean) 
>  ** encryptionType, e.g., AES-256
>  ** S3AccessPolicy, a JSON object expressing access policies, eg GetObject, PutObject
>  ** time of creation
>  ** Array of AWSS3BucketLifeCycleRules that are associated with the bucket 
>  ** Array of AWSS3CloudWatchMetrics that are associated with the bucket or its tags or
prefixes
>  ** Array of AWSTags that are associated with the bucket
>  * Generic dataset2Dataset process to represent movement of data from one dataset to
another.  It supports:
>  ** array of transforms performed by the process 
>  ** map of tag/value pairs representing configurationParameters of the process
>  ** inputs and outputs are arrays of dataset objects, e.g., kafka topic and S3 pseudo-directory.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message