airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shameera Rathnayaka <>
Subject Re: Work Stealing is not a good solution for Airavata.
Date Wed, 05 Oct 2016 15:14:32 GMT
Hi Amila,

Yeah, I agreed those people who are not much familiar with Airavata current
internal architecture (as it getting change)  above summary is not enough
to understand the issue. Let me explain some of your concern with more
details. Please see my comment inline.

On Tue, Oct 4, 2016 at 10:51 PM Amila Jayasekara <>

First of all, what do you mean by "work stealing design pattern"? I have
not heard such a design pattern and probably what you are trying to explain
is not "work stealing" (to my understanding). Work stealing is a
multi-threaded scheduling mechanism mostly use in systems that implement
lightweight threads. The basic idea is to have a queue per each lightweight
thread, and when a particular thread is idle, it can get work (steal) from
a busy lightweight thread. The architecture scenario you are explaining
does not go into that category of work stealing as per my understanding
(See [2] for more info). It is a more likely distribution of work.

Yes, this is not an actually a design pattern, even though I have used it
in that way. let me take it back and rephrase it, we have used work
stealing strategy to accomplish and solve distributes system nature and its
issues. I am well aware what is work stealing is, but Airavata uses that
concept to distribute work among workers( This is like load balance with
fixed configurable load per Worker) and also use this to address fault
tolerance of the components. That means, this work queues are not just a
tool, it is part of the architecture. In Short, instead of messaging ,
Airavata have used this AMQP to handle distributes system problems too.

Regarding 1 => To me, this looks like a limitation in the tools that we
use. As per what I understood the M, N issue is because of use of AMQP
prefetch. However, I did not understand, what AMQP prefetch is doing. Just
because the underlying tools has limitations, I don't think we can
criticise the design. Always try to keep the conceptual design and tools

Yeah, you are correct, prefetch is something comes with AMQP, and the bad
thing is Airavata depends on that limit, which decides the maximum load can
be handled by a worker at a given time. Refer this [1] to get an idea about
prefetch. If the architecture solely based on AMQP(this is not a tool) to
distribute works and address FT of Workers then we can't just ignore it.

Regarding 2 => For cancel experiments keeping priority queue make sense but
how long a request reside in the worker queue?
I have no idea what you are trying to explain using zookeeper and rabbitmq
in this context (also I don't know how those function). If you conceptually
describe the problem, i may able to give more feedback.

The worker doesn't ack the message until it finishes the work. The time
needs for a worker to finish the work is depend on the application run time
(applications runs on HPC machine). Theoretically, this can be from few sec
to days or even more. That means, If a worker can read N(=prefetch count)
number of messages from a queue without sending acknowledgment then that is
the limit one worker can handle for given time. But most of this
long-running jobs are asynchronous. Worker resources are free to handle
more works than N. Hence Airavata underutilized worker resources. In the
case of small jobs (small in runtime), this won't be a big problem.

Apache Zookeeper[2] provide a way to manage distributed system components
and most of  the latest distributed systems have been used Zookeeper to
address all the common distributed system problems like HA, FT, Leader
Election etc ... But in Airavata is trying to replace Rabbitmq with
Zookeeper to achieve the same outcomes, I haven't seen any framework have
done it. How Airavata tries to do is using Work Stealing Queues. Anyway,
Airavata hasn't move zookeeper out of its architecture yet as it uses
zookeeper to handle cancel requests.

Regarding 3 => Well.. email monitoring was alway problematic. More
precisely monitoring is problematic IMO. To handle monitoring I think we
should be able to get better feedback from job schedulers but as per my
experience, even those job schedulers are unreliable. Until we have a
better way to get feedback from job scheduler, monitoring is going to be
challenging. However, I don't understand why you have "serious scalability"
issues in GFac because of this.

Let me explain more about this, I have used some terms comes with latest
AMQP spec here. Let's say we have two GFac Workers in the system. and both
submit jobs to Computer Resources and waiting for status update via emails
(We finally decided to depend on emails for a while come up with more
robust monitoring solution) when emails come, there is email monitoring
server which reads this emails and put it to a rabbitmq exchange[3].  Then
each gfac worker has subscribed to this exchange to get all email updates.
Because Framework doesn't know which worker handle particular jobs, it
sends this email content to all workers who subscribe to that exchange.
There are few issues in this way.

1. what if one of the workers goes offline for a moment? It should receive
the messages from where he left the queue. To do that we need to use
persistence queue which doesn't remove when consumer disconnect. And this
queue should receive all email updates messages as well during the consumer
down time. To create a persistent queue and reconnect again to the same
queue, the consumer should know about the queue name. Ok, let's say worker
create this queue  with a name in the very first time it joins to email
monitoring exchange. Now this problem is solved see the second issue.

2. What if one worker node goes down and we start a different node ( runs
in different Machine/VM) . Now there is no way this new worker knows the
queue name created by the previous worker unless we configure it which is
not a very good solution where we have a pool of workers and this pool
getting changes time to time. Now the real problem is all the jobs handled
by down node is getting to this new node but there is no way it gets
previous email monitoring messages. which make these jobs hanging on it
previous state forever. Even the previously down worker comes up this might
not get previous jobs instead it retrieves a new set of jobs. This means we
can't scale Gfac workers independently. Hope this will explain the issue.

In summary, to me, none of these are concerns for the architecture at the
moment. Also, you cannot just naively complain an architecture is "not
good". Architecture has to be compared with another design and evaluate and
pros/cons for both. I suggest we first try to improve the existing design
to handle issues you are pointing.

It would be great to get some solution for above issues. IMHO, we have
overcomplicated Airavata design with Work Stealing approach, which is not
suitable for Airavata use cases and requirement.

Thank you for your feedbacks, hope I answered to all of your concerns,





On Tue, Oct 4, 2016 at 1:07 PM, Shameera Rathnayaka <>

Hi Devs,

Airavata has adopted to work stealing design pattern lately and use work
queue approach to distributing works among consumers. There are two work
queues in current Airavata architecture. One in middle of API Server and
Orchestrator and the second one in between Orchestrator and Gfac, Following
is very high-level Airavata architecture.

[image: Highleve Arachitecture- Airavata.png]
Here are the issues we have with above architecture.

1. Low resource utilization in Workers (Gfac/Orchestrator).
We have used AMPQ prefetch count to limit the number of requests served by
a Worker, which is not a good way to load balance in the
heterogeneous environment where different workers have different level of
resources. And it is recommended to keep this prefetch count minimum [1]
and this is valid for work stealing too. If we only have one worker and we
have M ( > N) number of long running jobs, and our prefetch count is N
then, only N jobs will in active mode. As we are run this long-running job
in the async way, we can handle more long running jobs than N. Therefore
workers resources are underutilized.

2. Even though we can easily deal with recovery requirement with work
stealing, it is not easy to handle cancel experiments. When this cancel
experiment comes the worker who works on this experiment should act
immediately. To add this behavior we need to introduce priority queues and
no need say this will add extra layer of complexity. Currently, we use
zookeeper to trigger cancel requests ( Another downside, we are using both
zookeeper and rabbitmq to solve different parts of Distributed systems
issues. Almost all latest Distributed system frameworks have being used
zookeeper to manage distributed system problems, we need to strongly
consider using zookeeper  as a way of managing our components and share the
load according to the resource available in workers)

3. Putting email to a queue is not a good solution with commodity servers
where system failures are expected. This email queue is critical, if we
missed one of the statuses of a job then this job can go to the unknown
state or hang in the old status forever. Due to this, we have serious
scalability issue with GFac at the moment due to a bottleneck of email

I think we need to re-evaluate Airavata architecture and find a good yet
simple solution based on requirements. The new architecture should handle
all existing issues and able to extend future requirement.


Shameera Rathnayaka

Shameera Rathnayaka

View raw message