atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <>
Subject [jira] [Commented] (ATLAS-801) Atlas hooks would lose messages if Kafka is down for extended period of time
Date Thu, 02 Jun 2016 06:44:59 GMT


Hemanth Yamijala commented on ATLAS-801:

Starting some analysis notes.

Firstly, I will try to see what can be done to minimize the probability of this happening
first. This is low hanging fruit to improve the current situation.

* We need to ensure we configure multiple replicas for ATLAS_HOOK in Kafka. This is already
documented as an operational guidance [here|]
under the *Notification Server* section. We could potentially automate this as part of server
setup of Atlas. This was the topic of ATLAS-515.
* We could add some retries to the producer config of Kafka. Currently, we use the default
values which is no retries.

I explored other configuration in Kafka producers and feel we are OK there. Specifically:

* *acks* - we use the default value of 1, which is acknowledgement from the leader alone.
This gives us a right balance between reliability and throughput.
* *batch.size* - we use the default value of 16KB. Empirically, our message size seems to
be about 8 KB. So maybe we send 2 messages per batch. Again, not too much to gain by changing
this here I guess.

> Atlas hooks would lose messages if Kafka is down for extended period of time
> ----------------------------------------------------------------------------
>                 Key: ATLAS-801
>                 URL:
>             Project: Atlas
>          Issue Type: Improvement
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
> All integration hooks in Atlas write messages to Kafka which are picked up by the Atlas
server. If communication to Kafka breaks, then this results in loss of metadata messages.
This can be mitigated to some extent using multiple replicas for Kafka topics (see ATLAS-515).
This JIRA is to see if we can make this even more robust and have some form of store and forward
mechanism for increased fault tolerance.

This message was sent by Atlassian JIRA

View raw message