atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ATLAS-801) Atlas hooks would lose messages if Kafka is down for extended period of time
Date Fri, 03 Jun 2016 07:03:59 GMT

    [ https://issues.apache.org/jira/browse/ATLAS-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313763#comment-15313763
] 

Hemanth Yamijala commented on ATLAS-801:
----------------------------------------

Thanks for sharing your thoughts, Vimal. There are different sorts of distributed failure
scenarios. If Kafka is indeed down, and no application could reach it - then everyone would
potentially end up recording messages to a distributed store. A linear ordering there with
timestamps will help. However, there could also be network partitions such that only one of
the hook instances is unable to reach Kafka. In that case, other instances which could be
activated later will not know about the one failed hook instance. This could result in out
of order messages. 

Given the complexity of distributed coordination, I don't think we should invest in solving
the problem of enforcing an order on the messages. Instead, I think we should invest in solving
the problem of dealing with out of order messages.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>
>                 Key: ATLAS-801
>                 URL: https://issues.apache.org/jira/browse/ATLAS-801
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>
> All integration hooks in Atlas write messages to Kafka which are picked up by the Atlas
server. If communication to Kafka breaks, then this results in loss of metadata messages.
This can be mitigated to some extent using multiple replicas for Kafka topics (see ATLAS-515).
This JIRA is to see if we can make this even more robust and have some form of store and forward
mechanism for increased fault tolerance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message