atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rémy SAISSY (JIRA) <j...@apache.org>
Subject [jira] [Commented] (ATLAS-164) DFS addon for Atlas
Date Thu, 17 Sep 2015 10:01:45 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802696#comment-14802696
] 

Rémy SAISSY commented on ATLAS-164:
-----------------------------------

Hi Venkatesh,
thanks.

* DfsDataModel 
I agree, at first I started by considering three classes: file, dir and symlink.
I reverted back to 1:1 mapping because handling symlink required to use two different properties
wether I had a file or a directory target. I thought that it would not be an issue to map
inodes since the query language enables to show files, dirs and symlinks separately.

A question, can we model class inheritance? If so, I could have dir,file and symlink classes
to inherit from inode and provide a clean symlink_target attribute with the parent class as
the type.

* Import

Thanks for the pointer, I will check how falcon does it. Appart from the technical standpoint,
I will also document myself a bit one regulatory needs since implementing as data sets is
that it will reduce the granularity thus maybe it might not be precise enough for some regulatory
needs.
Also, I see two approaches to data sets:
 - one that requires to manually define data sets using the webapp so the bridge will log
only those data sets (and forget about the other events on HDFS)
 - one that consider that a data set is a non-recursive directory. Any action on a file will
log an event for its directory

The latter has the advantage to process all actions in HDFS and to be easier to configure
and use for the end user so I would prefer it.

* Lineage

This is because I haven't yet fully understood how lineage should be handled in by Atlas addons.
 - should I also keep track of who executed what action on a data set / file / dir / symlink?
I haven't seen support for it in the hive-bridge but I guess it is required to comply with
regulatory needs.

Speaking about set of files consumed by a PIG,MR,Spark or whatever job, since HDFS sees actions
as they happen, I see two approaches:
 - HDFS level: considering a data set as being a non-recursive directory. That would be a
lot of events but all for the same node in Atlas (the source / target directory of the job)
 - processing framework level: hook an addon for each framework that log events into atlas
on the same data as the hdfs bridge ones.

--> I prefer doing it at the HDFS level only. It is more generic.

* Unit Tests

I've made a typo, I meant the integration test.


> DFS addon for Atlas
> -------------------
>
>                 Key: ATLAS-164
>                 URL: https://issues.apache.org/jira/browse/ATLAS-164
>             Project: Atlas
>          Issue Type: New Feature
>    Affects Versions: 0.6-incubating
>            Reporter: Rémy SAISSY
>            Assignee: Rémy SAISSY
>         Attachments: ATLAS-164.15092015.patch, ATLAS-164.15092015.patch
>
>
> Hi,
> I have wrote an addon for sending DFS metadata into Atlas.
> The patch is attached.
> However, I have a hard time getting the unit tests working properly thus some advices
would be welcome.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message