atlas-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <>
Subject [jira] [Commented] (ATLAS-511) Ability to run multiple instances of Atlas Server with automatic failover to one active server
Date Wed, 09 Mar 2016 03:10:40 GMT


Hemanth Yamijala commented on ATLAS-511:

[~vmadugun] / [~cassiodossantos], From my (possibly incomplete) understanding of the core
backend, a few points stand out in consideration of the TypeSystem cache:
* Currently Atlas relies on it *completely* for all reads. As Venkat mentioned in his comments,
DSL query translation to Gremlin query relies on this information. Since the volume of reads
is expected to be high, I intuitively feel that the cache is of value. Possibly not in the
aggressive manner in which it is currently relying on, but at least as a significant performance
optimization. Completely turning off the Cache in that sense seems to me a bit too extreme.
If we are modeling this, we could possibly model it as a strategy of which no caching is one
alternative, and read with fall through could be another. I am convinced by Cassio's point
that letting the types grow unbounded (the current implementation) feels a little too extreme
as well.
* You mention that types will be relatively unchanging. I am assuming that you are saying
this based on the usage pattern you have seen (or are envisioning to see). I had a question
on this. Seeing that trait definitions are also types and are also cached in the TypeSystem
and that all lookups of traits happen from here, how frequently are these CRUD'ed in your
case? Of course, this can be solved by a programmatic refresh (or using a dirty read mechanism)
as you both have suggested.

I am happy that we are aligned on basing any of these decisions on concrete measurements.
We have been working to set up some very basic test suites that well help us get started with
performance measurement. I will open JIRAs to spell out more details on this.
Venkat, thanks for your offer for help in this task. At this stage, since you have specific
interest in improving the cache behavior , it may be good if you can spend some energy on
this and see what you find. Please feel free to open JIRAs and propose your approach / solutions.

Needless to say, there are folks more experienced on this area than I am. I am hoping they
will chime in with thoughts (in particular, if we're going down the wrong track).

> Ability to run multiple instances of Atlas Server with automatic failover to one active
> ----------------------------------------------------------------------------------------------
>                 Key: ATLAS-511
>                 URL:
>             Project: Atlas
>          Issue Type: Sub-task
>            Reporter: Hemanth Yamijala
>            Assignee: Hemanth Yamijala
>         Attachments: HADesign.pdf
> One of the most important components that only supports active-standby mode currently
is the Atlas server which hosts the API / UI for Atlas. As described in the [HA Documentation|],
we currently are limited to running only one instance of the Atlas server behind a proxy service.
If the running instance goes down, a manual process is required to bring up another instance.
> In this JIRA, we propose to have an ability to run multiple Atlas server instances. However,
as a first step, only one of them will be actively processing requests. To have a consistent
terminology, let us call that server the *master*. Any requests sent to the other servers
will be redirected to the master.
> When the master suffers a partition, one of the other servers must automatically become
the master and start processing requests. What this mode brings us over the current system
is the ability to automatically failover the Atlas server instance without any  manual intervention.
Note that this can be arguably called an [active/active setup|]
> ATLAS-488 raised to support multiple active Atlas server instances. While that would
be ideal, we have to learn more about the underlying system behavior before we can get there,
and hopefully we can take smaller steps to improve the system systematically. The method proposed
here is similar to what is adopted in many other Hadoop components including HDFS NameNode,
HBase HMaster etc.

This message was sent by Atlassian JIRA

View raw message