airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shenoy, Gourav Ganesh" <>
Subject Re: Apache Helix as a Task Execution Framework for Airavata
Date Thu, 22 Jun 2017 19:40:56 GMT
Hi dev,

I have pushed the prototype project with Apache Helix on the Airavata Sandbox GitHub. This
project creates a simple task execution workflow (DAG) with 4 tasks, and runs it on 3 workers
(the cluster contains 1 controller node, 3 worker nodes, and 1 manager node). Here’s the

After discussing with Suresh, Supun, and Apoorv, we have safely agreed upon considering using
Helix to perform task execution for Airavata. I will now work on the architectural changes
needed to accommodate Helix. I shall update this list with proposed designs. In parallel,
I shall also start work on the develop branch to use Helix.

Thanks and Regards,
Gourav Shenoy

From: "Shenoy, Gourav Ganesh" <>
Reply-To: "" <>
Date: Wednesday, June 21, 2017 at 1:24 PM
To: "" <>
Subject: Apache Helix as a Task Execution Framework for Airavata

Hi dev,

Apache Helix is a generic cluster management framework, which allows one to build highly scalable
and fault tolerant distributed systems. It provides a range of functionalities, including:

·         Automatic assignment of resources (task executors) and it’s partitions (parallelism
of resources) to nodes in the cluster.

·         Detecting failure of nodes in the cluster, and taking appropriate actions to recover
them back.

·         Cluster management – adding nodes and resources to cluster dynamically, load

·         Ability to define an IDEAL STATE for a node – and defining STATE transitions
in case the state for a node deviates from the IDEAL one.

Apart from these, Helix also provides out-of-the-box APIs to perform Distributed Task Execution.
Some of the concepts Helix uses are ideal to our Airavata task execution use-case. These concepts

·         Tasks – actual runnable logic executors (eg: job submission, data staging, etc).
Tasks return a TaskResult object which contains the state of the task once completed. These
include, COMPLETED, FAILED, FATAL_FAILED. Difference between FAILED and FATAL_FAILED, is that
FAILED tasks are re-run by Helix (threshold can be set), whereas FATAL_FAILED tasks are not.

·         Jobs – A combination of tasks, without dependencies; i.e. if there are > 1
tasks, they are run in parallel across workers.

·         Workflow – A combination of jobs arranged in a DAG. In a ONE-TIME workflow, once
all jobs are completed, the workflow ends. In a RECURRING workflow, you can schedule workflows
to run periodically.

·         Job Queues – Another type of workflow, but never ends – keeps accepting new
incoming jobs. Ends only when user terminates it.


·         Helix also allows users to share data (key-value pairs) across Tasks/Jobs/Workflows.
The content stored at workflow layer can shared by different jobs belong to this workflow.
Similarly content persisted at job layer can shared by different tasks nested in this job.

·         Helix provides APIs to POLL either a JOB or WORKFLOW to reach a particular state.

Some core concepts used in Helix which are important to know:

·         Participant – Is a node in a Helix cluster (a.k.a. an instance or worker), which
host resources (a.k.a. tasks).

·         Controller – Is a node in a Helix cluster that monitors and controls the Participant
nodes. The controller is responsible for checking if the state of a participant node matches
the IDEAL state, and if not, perform STATE TRANSITIONS in order to bring that node back to
IDEAL state.

·         State Model & State Transitions – Helix allows developers to define what
state a participant node needs to be, in order to declare it healthy. Example, in an ONLINE-OFFLINE
state model, a node is healthy if it is in ONLINE state; whereas if it goes OFFLINE (for any
reason), we can define TRANSITION actions to bring it back ONLINE.

·         Cluster – Contains participants and controller nodes. One can define the State
model for a cluster.

How can Helix be used in Airavata??
Assuming we use Helix just to perform distributed task execution, I have the following in

·         Create Helix Tasks (by implementing the Task interface) for each of our job-submission,
data-staging, etc. These tasks are called resources.

·         Create Participant nodes (a.k.a. workers) to hold these resources. Helix allows
us to create resource partitions, such that if we need a Task to run in parallel across workers,
we can set the num_partitions > 1 for that resource.

·         Define a StateModel, either an OnlineOffline or MasterSlave, and necessary state
transitions. With state transitions we can control the behavior of the participant nodes.

·         Create a WORKFLOW to execute a single experiment. This workflow will contain DAG
necessary to run that experiment.

·         Create a long running QUEUE to keep accepting in-coming experiment requests. Each
new experiment request will result in creation of a new JOB to be added to this queue –
this job will contain one task – which is to create and run the workflow (mentioned in bullet

I have managed to get a working task execution framework prototype with Helix (Java). I am
improving it to accommodate mock airavata services as tasks, and mock experiment DAGs as workflows.
Before we can finalize on whether or not to use Helix, I would like to demonstrate this prototype
and then take it ahead from there.

I would love to hear more thoughts, suggestions or comments about this proposal. If anyone
is familiar with Helix, I would love to hear your inputs.

Thanks and Regards,
Gourav Shenoy

View raw message