airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shameera Rathnayaka <shameerai...@gmail.com>
Subject Re: Work Stealing is not a good solution for Airavata.
Date Thu, 06 Oct 2016 20:24:38 GMT
[image: Work Queue Message Life time.png]Hi Amila,

Please find work queue message execution sequence diagram below. Hope this
will help to understand how it works in Airavata.



On Thu, Oct 6, 2016 at 4:05 PM Suresh Marru <smarru@apache.org> wrote:

> Just a quick top post. This is informative discussion, please continue :)
>
> I agree on that Airavata does not do Work Stealing but it implements "Work
> Queues”. Conceptually they are similar to the OS Kernel level work queens,
> but more in a distributed context -
> https://www.kernel.org/doc/Documentation/workqueue.txt
>
> Suresh
>
>
> On Oct 6, 2016, at 3:52 PM, Amila Jayasekara <thejaka.amila@gmail.com>
> wrote:
>
> On Thu, Oct 6, 2016 at 3:17 PM, Shameera Rathnayaka <
> shameerainfo@gmail.com> wrote:
>
>
>
> On Thu, Oct 6, 2016 at 2:50 PM Amila Jayasekara <thejaka.amila@gmail.com>
> wrote:
>
> On Thu, Oct 6, 2016 at 11:07 AM, Shameera Rathnayaka <
> shameerainfo@gmail.com> wrote:
>
> Hi Amila,
>
> -- Please explain how you used "work stealing" in distributed system. That
> would be interesting.
>
>
> Airavata depends on work stealing + amqp for followings,
> Fault Tolerance - This is one of major distributed system problem which
> critical in Airavata, What ever the reason experiment request processing
> shouldn't get any effect  from internal node failure. Even with the node
> failures, Airavata should be capable enough to continue experiment request
> processing or hold it until at least one node appear and then continue. How
> this is handled in Ariavata is, worker only ack for messages only after it
> completely processed it. If the node goes down without sendings  ack for
> the messages it was processing,then rabbitmq put all these un-ack messages
> back to the queue and available to consume again.
>
> Resource Utilization- Another important goal of distributed system to
> effectively use available resources in the system, namely the memory and
> processors of components.  In Airavata this will decide the throughput and
> response time of experiments. Currently, at a given time workers only get
> messages up to a preconfigured limit (the limit is prefetch count) But most
> of these jobs are async jobs. That means after worker gets fixed amount of
> jobs, it won't get any other jobs even worker capable or handling more
> jobs, waste of worker resources.
>
>
>
> You still did not answer my question. I want to know how you used "work
> stealing" in your implementation. In other words how distributed work
> stealing works in your implementation. The details  you gave above is
> unrelated and does not answer my question.
>
>
> I think I have explained, how we use work stealing (work queues). If you
> are finding a more analog solution to parallel computing work strealing
> then that is hard to explain.
>
>
> No, you have not. :-).
> Work stealing != work queues. In a distributed setting I would image
> following kind of a work stealing implementation; Every worker
> (orchestrator) maintains a request queue locally and it serve requests
> coming to the local queue. Whenever one worker runs out of more requests to
> serve it will query other distributed workers local queues to see whether
> there are requests that it can serve. If there are it can steal requests
> from other workers local queues and process. However, this model of
> computation is in efficient to do in a distributed environment. I guess
> that is the same reason we dont find much distributed work stealing
> implementations.
>
> Anyhow lets stop the discussion about work stealing now. :-)
>
>
>
>
>
>
>
>
> -- I dont see AMQP in the architecture diagram you attached above and I
> dont understand why Airavata has to depend on it. One way to figure this
> out is think about the architecture without AMQP and figure out what
> actually should happend and look for a way to do that using AMQP.
>
>
> Worker Queue is AMQP queue.
>
>
> Does the worker queue needs to be an AMQP queue ? Sorry, I dont know much
> about AMQP but it sounds like limitations you are explaining are because of
> AMQP.
>
>
> It is not, but good to use well-defined protocol instead of custom one.
> Almost all messaging systems have implemented AMQP protocol.
>
>
> Can we figure out whether others have also encountered the same/similar
> problem and how they tackled those with AMQP ? Cos the design we have is
> pretty straightforward and I believe there are systems analogous to our
> design that uses AMQP.
>
>
>
>
>
>
>
> -- Does this mean that you have a waiting thread or process within
> Airavata after submitting the job (for each work) ?
>
>
> No, once the job is submitted to the remote resource, thread goes back to
> the thread pool.
>
>
> Then, your previous explanation, (i.e., "The time needs for a worker to
> finish the work is depend on the application run time (applications runs on
> HPC machine). Theoretically, this can be from few sec to days or even
> more."), invalidates. Correct ?
>
>
> No, it is still valid, thread goes to thread pool doesn't say worker is
> complete that request, it is waiting until actual hpc job runs on target
> computer resoruces. After this hpc jobs completed then outptu data staging
> happens. After output stage to storage then it ack to the work queue
> message.
>
>
> This is confusing to me.
> Does this mean once you return thread to thread pool, it is not reusable
> for another request ? Also, how do you wait on a thread after returning it
> to the thread pool ?
> Also, why do you have to wait for HPC job to complete ? I was under the
> impression the communication is asynchronous. i.e. after job completes you
> get an email confirmation and then you start output data staging in a
> separate thread.
>
> We should probably meet and verbally discuss this.
>
> -AJ
>
>
>
> Thanks,
> Shameera.
>
>
>
>
>
> Thanks,
> Shameera.
>
>
>
> It takes more time for me to digest following right now. I will try to
> give more feedback when I properly understand them.
>
> Thanks
> -Amila
>
> That means, If a worker can read N(=prefetch count) number of messages
> from a queue without sending acknowledgment then that is the limit one
> worker can handle for given time. But most of this long-running jobs are
> asynchronous. Worker resources are free to handle more works than N. Hence
> Airavata underutilized worker resources. In the case of small jobs (small
> in runtime), this won't be a big problem.
>
> Apache Zookeeper[2] provide a way to manage distributed system components
> and most of  the latest distributed systems have been used Zookeeper to
> address all the common distributed system problems like HA, FT, Leader
> Election etc ... But in Airavata is trying to replace Rabbitmq with
> Zookeeper to achieve the same outcomes, I haven't seen any framework have
> done it. How Airavata tries to do is using Work Stealing Queues. Anyway,
> Airavata hasn't move zookeeper out of its architecture yet as it uses
> zookeeper to handle cancel requests.
>
>
>
>
> Regarding 3 => Well.. email monitoring was alway problematic. More
> precisely monitoring is problematic IMO. To handle monitoring I think we
> should be able to get better feedback from job schedulers but as per my
> experience, even those job schedulers are unreliable. Until we have a
> better way to get feedback from job scheduler, monitoring is going to be
> challenging. However, I don't understand why you have "serious scalability"
> issues in GFac because of this.
>
>
> Let me explain more about this, I have used some terms comes with latest
> AMQP spec here. Let's say we have two GFac Workers in the system. and both
> submit jobs to Computer Resources and waiting for status update via emails
> (We finally decided to depend on emails for a while come up with more
> robust monitoring solution) when emails come, there is email monitoring
> server which reads this emails and put it to a rabbitmq exchange[3].  Then
> each gfac worker has subscribed to this exchange to get all email updates.
> Because Framework doesn't know which worker handle particular jobs, it
> sends this email content to all workers who subscribe to that exchange.
> There are few issues in this way.
>
> 1. what if one of the workers goes offline for a moment? It should receive
> the messages from where he left the queue. To do that we need to use
> persistence queue which doesn't remove when consumer disconnect. And this
> queue should receive all email updates messages as well during the consumer
> down time. To create a persistent queue and reconnect again to the same
> queue, the consumer should know about the queue name. Ok, let's say worker
> create this queue  with a name in the very first time it joins to email
> monitoring exchange. Now this problem is solved see the second issue.
>
> 2. What if one worker node goes down and we start a different node ( runs
> in different Machine/VM) . Now there is no way this new worker knows the
> queue name created by the previous worker unless we configure it which is
> not a very good solution where we have a pool of workers and this pool
> getting changes time to time. Now the real problem is all the jobs handled
> by down node is getting to this new node but there is no way it gets
> previous email monitoring messages. which make these jobs hanging on it
> previous state forever. Even the previously down worker comes up this might
> not get previous jobs instead it retrieves a new set of jobs. This means we
> can't scale Gfac workers independently. Hope this will explain the issue.
>
>
>
> In summary, to me, none of these are concerns for the architecture at the
> moment. Also, you cannot just naively complain an architecture is "not
> good". Architecture has to be compared with another design and evaluate and
> pros/cons for both. I suggest we first try to improve the existing design
> to handle issues you are pointing.
>
>
> It would be great to get some solution for above issues. IMHO, we have
> overcomplicated Airavata design with Work Stealing approach, which is not
> suitable for Airavata use cases and requirement.
>
> Thank you for your feedbacks, hope I answered to all of your concerns,
>
>
> [1] http://www.rabbitmq.com/consumer-prefetch.html
> [2] https://zookeeper.apache.org
> [3] https://www.rabbitmq.com/tutorials/tutorial-two-python.html
>
> Thanks,
> Shameera.
>
> [2] https://en.wikipedia.org/wiki/Work_stealing
>
> Thanks
> -Amila
>
> On Tue, Oct 4, 2016 at 1:07 PM, Shameera Rathnayaka <
> shameerainfo@gmail.com> wrote:
>
> Hi Devs,
>
> Airavata has adopted to work stealing design pattern lately and use work
> queue approach to distributing works among consumers. There are two work
> queues in current Airavata architecture. One in middle of API Server and
> Orchestrator and the second one in between Orchestrator and Gfac, Following
> is very high-level Airavata architecture.
>
>
> <Highleve Arachitecture- Airavata.png>
> Here are the issues we have with above architecture.
>
> 1. Low resource utilization in Workers (Gfac/Orchestrator).
> We have used AMPQ prefetch count to limit the number of requests served by
> a Worker, which is not a good way to load balance in the
> heterogeneous environment where different workers have different level of
> resources. And it is recommended to keep this prefetch count minimum [1]
> and this is valid for work stealing too. If we only have one worker and we
> have M ( > N) number of long running jobs, and our prefetch count is N
> then, only N jobs will in active mode. As we are run this long-running job
> in the async way, we can handle more long running jobs than N. Therefore
> workers resources are underutilized.
>
> 2. Even though we can easily deal with recovery requirement with work
> stealing, it is not easy to handle cancel experiments. When this cancel
> experiment comes the worker who works on this experiment should act
> immediately. To add this behavior we need to introduce priority queues and
> no need say this will add extra layer of complexity. Currently, we use
> zookeeper to trigger cancel requests ( Another downside, we are using both
> zookeeper and rabbitmq to solve different parts of Distributed systems
> issues. Almost all latest Distributed system frameworks have being used
> zookeeper to manage distributed system problems, we need to strongly
> consider using zookeeper  as a way of managing our components and share the
> load according to the resource available in workers)
>
> 3. Putting email to a queue is not a good solution with commodity servers
> where system failures are expected. This email queue is critical, if we
> missed one of the statuses of a job then this job can go to the unknown
> state or hang in the old status forever. Due to this, we have serious
> scalability issue with GFac at the moment due to a bottleneck of email
> monitoring.
>
> I think we need to re-evaluate Airavata architecture and find a good yet
> simple solution based on requirements. The new architecture should handle
> all existing issues and able to extend future requirement.
>
>
> [1]
> http://www.mariuszwojcik.com/blog/How-to-choose-prefetch-count-value-for-RabbitMQ
>
>
> --
> Shameera Rathnayaka
>
>
> --
> Shameera Rathnayaka
>
> --
> Shameera Rathnayaka
>
> --
> Shameera Rathnayaka
>
>
>
> --
Shameera Rathnayaka

Mime
View raw message