airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shameera Rathnayaka <shameerai...@gmail.com>
Subject Re: Work Stealing is not a good solution for Airavata.
Date Thu, 06 Oct 2016 20:26:03 GMT
Previous attachment doesn't work.

On Thu, Oct 6, 2016 at 4:24 PM, Shameera Rathnayaka <shameerainfo@gmail.com>
wrote:

> [image: Work Queue Message Life time.png]Hi Amila,
>
> Please find work queue message execution sequence diagram below. Hope this
> will help to understand how it works in Airavata.
>
>
>
> On Thu, Oct 6, 2016 at 4:05 PM Suresh Marru <smarru@apache.org> wrote:
>
>> Just a quick top post. This is informative discussion, please continue :)
>>
>> I agree on that Airavata does not do Work Stealing but it implements
>> "Work Queues”. Conceptually they are similar to the OS Kernel level work
>> queens, but more in a distributed context - https://www.kernel.org/doc/
>> Documentation/workqueue.txt
>>
>> Suresh
>>
>>
>> On Oct 6, 2016, at 3:52 PM, Amila Jayasekara <thejaka.amila@gmail.com>
>> wrote:
>>
>> On Thu, Oct 6, 2016 at 3:17 PM, Shameera Rathnayaka <
>> shameerainfo@gmail.com> wrote:
>>
>>
>>
>> On Thu, Oct 6, 2016 at 2:50 PM Amila Jayasekara <thejaka.amila@gmail.com>
>> wrote:
>>
>> On Thu, Oct 6, 2016 at 11:07 AM, Shameera Rathnayaka <
>> shameerainfo@gmail.com> wrote:
>>
>> Hi Amila,
>>
>> -- Please explain how you used "work stealing" in distributed system.
>> That would be interesting.
>>
>>
>> Airavata depends on work stealing + amqp for followings,
>> Fault Tolerance - This is one of major distributed system problem which
>> critical in Airavata, What ever the reason experiment request processing
>> shouldn't get any effect  from internal node failure. Even with the node
>> failures, Airavata should be capable enough to continue experiment request
>> processing or hold it until at least one node appear and then continue. How
>> this is handled in Ariavata is, worker only ack for messages only after it
>> completely processed it. If the node goes down without sendings  ack for
>> the messages it was processing,then rabbitmq put all these un-ack messages
>> back to the queue and available to consume again.
>>
>> Resource Utilization- Another important goal of distributed system to
>> effectively use available resources in the system, namely the memory and
>> processors of components.  In Airavata this will decide the throughput and
>> response time of experiments. Currently, at a given time workers only get
>> messages up to a preconfigured limit (the limit is prefetch count) But most
>> of these jobs are async jobs. That means after worker gets fixed amount of
>> jobs, it won't get any other jobs even worker capable or handling more
>> jobs, waste of worker resources.
>>
>>
>>
>> You still did not answer my question. I want to know how you used "work
>> stealing" in your implementation. In other words how distributed work
>> stealing works in your implementation. The details  you gave above is
>> unrelated and does not answer my question.
>>
>>
>> I think I have explained, how we use work stealing (work queues). If you
>> are finding a more analog solution to parallel computing work strealing
>> then that is hard to explain.
>>
>>
>> No, you have not. :-).
>> Work stealing != work queues. In a distributed setting I would image
>> following kind of a work stealing implementation; Every worker
>> (orchestrator) maintains a request queue locally and it serve requests
>> coming to the local queue. Whenever one worker runs out of more requests to
>> serve it will query other distributed workers local queues to see whether
>> there are requests that it can serve. If there are it can steal requests
>> from other workers local queues and process. However, this model of
>> computation is in efficient to do in a distributed environment. I guess
>> that is the same reason we dont find much distributed work stealing
>> implementations.
>>
>> Anyhow lets stop the discussion about work stealing now. :-)
>>
>>
>>
>>
>>
>>
>>
>>
>> -- I dont see AMQP in the architecture diagram you attached above and I
>> dont understand why Airavata has to depend on it. One way to figure this
>> out is think about the architecture without AMQP and figure out what
>> actually should happend and look for a way to do that using AMQP.
>>
>>
>> Worker Queue is AMQP queue.
>>
>>
>> Does the worker queue needs to be an AMQP queue ? Sorry, I dont know much
>> about AMQP but it sounds like limitations you are explaining are because of
>> AMQP.
>>
>>
>> It is not, but good to use well-defined protocol instead of custom one.
>> Almost all messaging systems have implemented AMQP protocol.
>>
>>
>> Can we figure out whether others have also encountered the same/similar
>> problem and how they tackled those with AMQP ? Cos the design we have is
>> pretty straightforward and I believe there are systems analogous to our
>> design that uses AMQP.
>>
>>
>>
>>
>>
>>
>>
>> -- Does this mean that you have a waiting thread or process within
>> Airavata after submitting the job (for each work) ?
>>
>>
>> No, once the job is submitted to the remote resource, thread goes back to
>> the thread pool.
>>
>>
>> Then, your previous explanation, (i.e., "The time needs for a worker to
>> finish the work is depend on the application run time (applications runs on
>> HPC machine). Theoretically, this can be from few sec to days or even
>> more."), invalidates. Correct ?
>>
>>
>> No, it is still valid, thread goes to thread pool doesn't say worker is
>> complete that request, it is waiting until actual hpc job runs on target
>> computer resoruces. After this hpc jobs completed then outptu data staging
>> happens. After output stage to storage then it ack to the work queue
>> message.
>>
>>
>> This is confusing to me.
>> Does this mean once you return thread to thread pool, it is not reusable
>> for another request ? Also, how do you wait on a thread after returning it
>> to the thread pool ?
>> Also, why do you have to wait for HPC job to complete ? I was under the
>> impression the communication is asynchronous. i.e. after job completes you
>> get an email confirmation and then you start output data staging in a
>> separate thread.
>>
>> We should probably meet and verbally discuss this.
>>
>> -AJ
>>
>>
>>
>> Thanks,
>> Shameera.
>>
>>
>>
>>
>>
>> Thanks,
>> Shameera.
>>
>>
>>
>> It takes more time for me to digest following right now. I will try to
>> give more feedback when I properly understand them.
>>
>> Thanks
>> -Amila
>>
>> That means, If a worker can read N(=prefetch count) number of messages
>> from a queue without sending acknowledgment then that is the limit one
>> worker can handle for given time. But most of this long-running jobs are
>> asynchronous. Worker resources are free to handle more works than N. Hence
>> Airavata underutilized worker resources. In the case of small jobs (small
>> in runtime), this won't be a big problem.
>>
>> Apache Zookeeper[2] provide a way to manage distributed system components
>> and most of  the latest distributed systems have been used Zookeeper to
>> address all the common distributed system problems like HA, FT, Leader
>> Election etc ... But in Airavata is trying to replace Rabbitmq with
>> Zookeeper to achieve the same outcomes, I haven't seen any framework have
>> done it. How Airavata tries to do is using Work Stealing Queues. Anyway,
>> Airavata hasn't move zookeeper out of its architecture yet as it uses
>> zookeeper to handle cancel requests.
>>
>>
>>
>>
>> Regarding 3 => Well.. email monitoring was alway problematic. More
>> precisely monitoring is problematic IMO. To handle monitoring I think we
>> should be able to get better feedback from job schedulers but as per my
>> experience, even those job schedulers are unreliable. Until we have a
>> better way to get feedback from job scheduler, monitoring is going to be
>> challenging. However, I don't understand why you have "serious scalability"
>> issues in GFac because of this.
>>
>>
>> Let me explain more about this, I have used some terms comes with latest
>> AMQP spec here. Let's say we have two GFac Workers in the system. and both
>> submit jobs to Computer Resources and waiting for status update via emails
>> (We finally decided to depend on emails for a while come up with more
>> robust monitoring solution) when emails come, there is email monitoring
>> server which reads this emails and put it to a rabbitmq exchange[3].  Then
>> each gfac worker has subscribed to this exchange to get all email updates.
>> Because Framework doesn't know which worker handle particular jobs, it
>> sends this email content to all workers who subscribe to that exchange.
>> There are few issues in this way.
>>
>> 1. what if one of the workers goes offline for a moment? It should
>> receive the messages from where he left the queue. To do that we need to
>> use persistence queue which doesn't remove when consumer disconnect. And
>> this queue should receive all email updates messages as well during the
>> consumer down time. To create a persistent queue and reconnect again to the
>> same queue, the consumer should know about the queue name. Ok, let's say
>> worker create this queue  with a name in the very first time it joins to
>> email monitoring exchange. Now this problem is solved see the second issue.
>>
>> 2. What if one worker node goes down and we start a different node ( runs
>> in different Machine/VM) . Now there is no way this new worker knows the
>> queue name created by the previous worker unless we configure it which is
>> not a very good solution where we have a pool of workers and this pool
>> getting changes time to time. Now the real problem is all the jobs handled
>> by down node is getting to this new node but there is no way it gets
>> previous email monitoring messages. which make these jobs hanging on it
>> previous state forever. Even the previously down worker comes up this might
>> not get previous jobs instead it retrieves a new set of jobs. This means we
>> can't scale Gfac workers independently. Hope this will explain the issue.
>>
>>
>>
>> In summary, to me, none of these are concerns for the architecture at the
>> moment. Also, you cannot just naively complain an architecture is "not
>> good". Architecture has to be compared with another design and evaluate and
>> pros/cons for both. I suggest we first try to improve the existing design
>> to handle issues you are pointing.
>>
>>
>> It would be great to get some solution for above issues. IMHO, we have
>> overcomplicated Airavata design with Work Stealing approach, which is not
>> suitable for Airavata use cases and requirement.
>>
>> Thank you for your feedbacks, hope I answered to all of your concerns,
>>
>>
>> [1] http://www.rabbitmq.com/consumer-prefetch.html
>> [2] https://zookeeper.apache.org
>> [3] https://www.rabbitmq.com/tutorials/tutorial-two-python.html
>>
>> Thanks,
>> Shameera.
>>
>> [2] https://en.wikipedia.org/wiki/Work_stealing
>>
>> Thanks
>> -Amila
>>
>> On Tue, Oct 4, 2016 at 1:07 PM, Shameera Rathnayaka <
>> shameerainfo@gmail.com> wrote:
>>
>> Hi Devs,
>>
>> Airavata has adopted to work stealing design pattern lately and use work
>> queue approach to distributing works among consumers. There are two work
>> queues in current Airavata architecture. One in middle of API Server and
>> Orchestrator and the second one in between Orchestrator and Gfac, Following
>> is very high-level Airavata architecture.
>>
>>
>> <Highleve Arachitecture- Airavata.png>
>> Here are the issues we have with above architecture.
>>
>> 1. Low resource utilization in Workers (Gfac/Orchestrator).
>> We have used AMPQ prefetch count to limit the number of requests served
>> by a Worker, which is not a good way to load balance in the
>> heterogeneous environment where different workers have different level of
>> resources. And it is recommended to keep this prefetch count minimum [1]
>> and this is valid for work stealing too. If we only have one worker and we
>> have M ( > N) number of long running jobs, and our prefetch count is N
>> then, only N jobs will in active mode. As we are run this long-running job
>> in the async way, we can handle more long running jobs than N. Therefore
>> workers resources are underutilized.
>>
>> 2. Even though we can easily deal with recovery requirement with work
>> stealing, it is not easy to handle cancel experiments. When this cancel
>> experiment comes the worker who works on this experiment should act
>> immediately. To add this behavior we need to introduce priority queues and
>> no need say this will add extra layer of complexity. Currently, we use
>> zookeeper to trigger cancel requests ( Another downside, we are using both
>> zookeeper and rabbitmq to solve different parts of Distributed systems
>> issues. Almost all latest Distributed system frameworks have being used
>> zookeeper to manage distributed system problems, we need to strongly
>> consider using zookeeper  as a way of managing our components and share the
>> load according to the resource available in workers)
>>
>> 3. Putting email to a queue is not a good solution with commodity servers
>> where system failures are expected. This email queue is critical, if we
>> missed one of the statuses of a job then this job can go to the unknown
>> state or hang in the old status forever. Due to this, we have serious
>> scalability issue with GFac at the moment due to a bottleneck of email
>> monitoring.
>>
>> I think we need to re-evaluate Airavata architecture and find a good yet
>> simple solution based on requirements. The new architecture should handle
>> all existing issues and able to extend future requirement.
>>
>>
>> [1] http://www.mariuszwojcik.com/blog/How-to-choose-
>> prefetch-count-value-for-RabbitMQ
>>
>> --
>> Shameera Rathnayaka
>>
>>
>> --
>> Shameera Rathnayaka
>>
>> --
>> Shameera Rathnayaka
>>
>> --
>> Shameera Rathnayaka
>>
>>
>>
>> --
> Shameera Rathnayaka
>



-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blogs : https://shameerarathnayaka.wordpress.com ,
http://shameerarathnayaka.blogspot.com/

Mime
View raw message