atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <>
Subject [jira] [Commented] (ATLAS-512) Decouple currently integrating components from availability of Atlas service for raising metadata events
Date Wed, 24 Feb 2016 11:47:18 GMT


Hemanth Yamijala commented on ATLAS-512:

There are a couple of ways of doing this. I am trying to list the approaches here with their
pros and cons, and get feedback:

*Option 1: Move model registration out of the hooks and into an independent tool / script*
* Mechanics:
** Every integrating component will provide an implementation that will have the serialized
model and a 'signature type'. A utility in Atlas will take these and call the create type
** This utility will encode the current logic of checking for a type before registering. The
'signature type' is used for this purpose
** This utility should essentially be called before any entity creation happens from the hooks
- so really like a setup step for Atlas. There are a couple of ways of doing this as well,
I guess.
* Pros:
** The chief advantage is that we can make the model registration a one time activity done
in a controlled environment.
** Because we are using an API, we can get feedback on the success or failure of the model
registration and have the chance of acting on it.
** Speaking in the shorter term, this is fewer changes to the Atlas system. Mainly, it can
be setup to not touch Atlas server side at all.
* Cons:
** Depending on implementation, there is a possibility that this registration does not happen
(mostly due to human error) and entity creates/updates could fail due that.
** Has a dependency that the Atlas server is running for this to work. (In defense, this is
a one-time activity)

*Option 2: Write type creations through Kafka from the hooks, instead of the API*
* Mechanics:
** A hook will send the type creations as notifications to the ATLAS_HOOK topic of Atlas.
** The hook will not check if the type is already registered. This implies client side hooks
like Storm would write this multiple times (unless they maintain state independently)
** The Hook consumer of Atlas should be extended to process this new type of message (like
TYPE_CREATE - which is already there). The consumer should check if the type is already registered,
and if yes, not act on the type. A log would be helpful to audit such an event. Else, it calls
the create type API.
* Pros:
** This retains the spirit of hooks auto-registering models. Hence, the chance for errors
is minimized
** We remove dependency on the Atlas server for everything
* Cons:
** We cannot give feedback on type registration in this mechanism.
** Since client side hooks (which don't maintain state) can write types multiple times, there
could be some load on Kafka (for things like Hive CLI, this could be non-trivial impact, but
certainly not something that Kafka cannot handle). It just feels wasteful to do so.
** This change is more intrusive as the Atlas server will need some modifications for this
to work.

Personally, moving to Kafka seems like it will work ok chiefly because it retains the current
spirit of auto-registration and does not introduce any setup step that could get error prone
for users managing Atlas. The only concern is with client components writing the type definitions
multiple times. If this gets to be really an issue, we may need the clients to maintain state
in some persistent state store.

Thoughts from others? 

> Decouple currently integrating components  from availability of Atlas service for raising
metadata events
> ---------------------------------------------------------------------------------------------------------
>                 Key: ATLAS-512
>                 URL:
>             Project: Atlas
>          Issue Type: Sub-task
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
> The components that currently integrate with Atlas (Hive, Sqoop, Falcon, Storm) all communicate
their metadata events using Kafka as a messaging layer. This effectively decouples these components
from the Atlas server. 
> However, all of these components have some initialization that checks if their respective
models are registered with Atlas. For components that integrate on the server, like HiveServer2
and Falcon, this initialization is a one time check and hence, is manageable. Others like
Sqoop, Storm and the Hive CLI are client side components and hence the initialization happens
for every run or session of these components. Invoking the initialization (and the one time
check) every time like this effectively means that the Atlas server should be always available.
> This JIRA is to try and remove this dependency and thus truly decouple these components.

This message was sent by Atlassian JIRA

View raw message