airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DImuthu Upeksha <>
Subject Re: Async Agents to handle long running jobs
Date Tue, 05 Dec 2017 14:31:56 GMT
Hi Suresh,

Thanks for the reply. Pease find my response inline for your questions

On Tue, Dec 5, 2017 at 7:58 AM, Suresh Marru <> wrote:

> Hi Dimuthu,
> This is neat design. Few questions to understand your implementation:
> * Since the Async command monitor needs to be persistent high available
> service, is it advisable to run it as a Helix Participant or should we run
> this outside of helix system like a API gateway?

This design does not assume Async Command Monitor as a persistent service.
It reads the status of the Agent and directs the message flow in the
correct path. In java world, it is like a switch case. However we need to
make it highly available. By making it a Helix Participant and controlling
the replication through Kubernetes, we can fulfill above requirement and
keep it also as a generic component in the system.

> * On a related note, any thoughts on running database also as part of the
> kubernetes cluster? K8s has a MySQL example [1] but wondering on any other
> pragmatic experiences.

Good suggestion. I also had that idea not only for MySQL, but for Kafka and
Zookeeper. There are few challenges when we are trying to containerize
those applications.

1. Applications like Zookeeper has a static unique name for each node in
the Zookeeper quorum. And each node should be configured to know about
other nodes before starting the node. For example each zoo.cfg file should
contain entries like this before starting the cluster

This is not container friendly. Containers are normally stateless. So it is
challenging to spin up a failed container with the same identity (both form
the same host name and static configuration). Kubernetes solves this by a
concept called Stateful Sets where the newly spawned pod contains the same
host name of the dead pod and same persistent volume.

2. Databases like MySQL should have a persistent data directory. So we
should make sure that the newly spawned pods should be placed at the same
node (physical machine) where old ones existed as data directories are not
replicated among the nodes of the Kubernetes cluster. In this case also we
should be able to use Stateful Sets to solve above issue. The link you
shared also provides a good evidence for that

3. Above point (data directories) are valid for Kafka brokers. However most
of the issues that we come across in containerizing Kafka brokers are also
solved using Stateful Sets [1].

So as a summary, we can almost deploy all 3 applications in Kubernetes in
highly available manner with auto healing features. But we have to think
about following facts aslo

1. These applications are not designed to run in containerized
environments. I would say we are using some "hacks" to make it container

2. They are inherently highly available so why do we need to introduce
another layer of high availability?

3. We can achieve auto healing in a Kubernetes cluster where a failed pod
is automatically replaced by a new pod. But we can not let it to place in a
different node (physical machine) because of the above constraints. So if a
node failed, we can not use auto healing functionality of Kuberentes in
this case.

There are pros and cons when we are selecting an either approach. I think
this should be open for discussion and get the viewpoints of others as
well. Personally I'm +0 for Kubernetes approach :)

> * We need to write the event listener preferably in Python since these
> typically run on a compute cluster where java is not so well supported and
> python is more ubiquitous.

That is possible. Event Listener interacts with Kafka and invokes API
server. We can port them to python easily. However, as we are ultimately
bundling them as Docker containers, language that we are using should not
be an issue as the all the libraries that are required for each language
are bundled in the same container image. We only need the Kernal of the
host machine and docker installed on it with Kubernetes agents. I'm not
sure that I have completely understood your claim about not supporting for
java. Don't those compute machines support Java in Kernel level?

> * What is your suggestion on the job description (the message payload in
> your example) format? Can we send in a thrift binary through Kafka and have
> the listener parse out the required information?

Should be possible and a good suggestion. We can write custom serializers
and deserializers for Kafka message topics [2].

> Suresh
> [1] -
> replicated-stateful-application/
> On Dec 4, 2017, at 1:30 PM, DImuthu Upeksha <>
> wrote:
> Hi folks,
> I have implemented the support to Async Job Submission with the callback
> workflows on top of the proposed task execution framework. This supports to
> both Async Job Submission in remote compute resources using Agents and
> event driven job monitoring. Using this approach, I'm going to address
> following issues that we are facing today
> 1. Resolve fault DoS attack detection in compute resources when doing
> multiple ssh command executions in a short period of time.
> 2. Optimize resource utilization and robustness of Airavata Task Execution
> Framework when executing long running jobs
> Design and implementation details can be found from [1].
> Sources for the main components can be found from [2], [3], [4]
> Please share your comments and suggestions
> [1]
> uRD5WO-eB6TLxagAg/edit?usp=sharing
> [2]
> kubernetes/modules/microservices/async-event-listener
> [3]
> kubernetes/modules/microservices/tasks/async-command-monitor
> [4]
> kubernetes/modules/microservices/tasks/async-command-task
> Thanks
> Dimuthu
> [1]


View raw message