airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lahiru Gunathilake <glah...@gmail.com>
Subject Re: Airavata Orchestrator component
Date Fri, 06 Dec 2013 03:12:25 GMT
Hi Amila,

I have answered questions you raised except some how to questions (for how
questions we need to figure out solutions, before that we need to come up
with good design).


On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara <thejaka.amila@gmail.com>wrote:

>
>
>
> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <glahiru@gmail.com>wrote:
>
>> Hi All,
>>
>> We are thinking of implementing an Airavata Orchestrator component to
>> replace WorkflowInterpreter to avoid gateway developers to dealing with
>> workflows when they simply have one single independent jobs to run in their
>> gateways. This component is mainly focusing on how to invoke GFAC and
>> accept requests from the client API.
>>
>> I have following features in mind about this component.
>>
>> 1. It gives a web services or REST interface where we can implement a
>> client to invoke it to submit jobs.
>>
>> 2. Accepts a job request and parse the input types and if input types are
>> correct, this will create an Airavata experiment ID.
>>
>> 3. Orchestrtor then store the job information to registry against the
>> generated experiment ID (All the other components identify the job using
>> this experiment ID).
>>
>> 4. After that Orchestrator pull up all the descriptors related to this
>> request and do some scheduling to decide where to run the job and submit
>> the job to a GFAC node (Handling multiple GFAC nodes is going to be a
>> future improvement in Orchestrator).
>>
>> If we are trying to do pull based job submission it might be a good idea
>> to handle errors, if we store jobs to Registry and GFAC pull jobs and
>> execute them Orchestrator component really doesn' t have to worry about the
>> error handling.
>>
>
> I did not quite understand what you meant by "pull based job submission".
> I believe it is saving job in registry and periodically GFAC looking up for
> new jobs and submitting them.
>
Yes.

> Further why are you saying you dont need to worry about error handling ?
> What sort of errors are you considering ?
>
I am considering GFAC failures or connection between Orchestrator and GFAC
goes down.

>
>
>>
>> Because we can implement a logic to GFAC if a particular job is not
>> updating its status fora g iven time it assume job is hanged or either GFAC
>> node which handles that job is fauiled, so  GFAC pull that job (we
>> definitely need a locking mechanism here, to avoid two instances are not
>> going to  execute hanged job) and  start execute it. (If GFAC is handling a
>> long running job still it has to update the job stutus frequently with the
>> same status to make sure GFAC node is running).
>>
>
> I have some comments/questions on this regard;
>
> 1. How are you going to detect that job is hanged ?
>
> 2. We clearly need to distinguish between fault jobs and fault GFAC
> instances. Because GFAC replication should not pick the job if its logic is
> leading to hang situation.
>
I haven't seen hanged logic situation, may be there are.

> GFAC replication should pick the job only if primary GFAC instance is
> down. I believe you proposed locking mechanism to handle this scenario. But
> I dont see how locking mechanism going to resolve this situation. Can you
> explain more ?
>
Example if gfac has an logic of picking up a job which didn't response in a
given time there could be a scenario where two gfac instances try to pick
the same job. Ex: there are 3 gfac nodes working and one goes down with a
given job. And two other nodes recognize this at the same time and try to
launch the sam ejob. I was talking about locks to fix this issue.

>
> 2. According to your description, it seems there is no communication
> between GFAC instance and Orchastrator.So GFAC and Orchastrator exchange
> data through registry (Database). Performance might drop since we are going
> through persisting mediums.
>
Yes you are correct, I am assuming we are mostly focusing on implementing
more reliable system and most of these jobs are running hours, and we don't
need to implement high performance system for a system with  long running
jobs.

>
> 3. What is the strategy to divide jobs among GFAC instances ?
>
Not sure, we have to discuss it.

>
> 4. How to identify GFAC instance is failed ?
>
> 5. How GFAC instances should be registered with the orchastrator ?
>
> 6. How job cancellations are handled ?
>
> 7. What happend if Orchestrator goes down ?
>
This is under assumption Orchestrator doesn't go down (Ex: as a Head node
in Map reduce).

>
> 8. Does monitoring execution paths go throug Orchastrato ?
>
I intensionally didn't mention about monitoring, how about we discuss it
separate.

>
> 9. How does fail over work ?
>

What do you mean and whose fail over ?

>
>
>>
>> 5. GFAC creates its execution chain and store it back to registry with
>> experiment ID, and GFAC updates its states using check pointing.
>>
>>
>> 6. If we are not doing pull based submission,during a GFAC failure
>> Orchestrator have to identify it and submit the active jobs from failure
>> gfac node  to other nodes.
>>
>
> I think there is more communication need to happend here.
> 1. When Orchastrator first deposit the job it should be unsubmitted state.
> 2. GFAC should only update state to active after really submitting it to
> resource
>
I agree, there could be few important states like
input transfered, job submitted, job finished, output transfered.

>
> Incase of a GFAC instance failure the secondary GFAC should go through all
> unfinished jobs relevant to failed and get there state by consulting the
> resource. If those jobs are still in active state monitoring mechanism
> should be established. We only need to re-submit jobs if they are in
> unsubmitted state.
>
+1.

>
> To precisely implement this we need a 2-phase commit like mechanism. Then
> we can make sure jobs will not duplicate.
>
+1.


Thanks amila for compiling the email carefully.

Regards
Lahiru

>
>
>
>> This might cause job duplication in case Orchestrator falls alarm about
>> GFAC failure (so have to handle carefully).
>>
>> We have lot more to discus about the GFAC but I limit our discussion to
>> Orchestrator component for now.
>>
>> WDYT about this design ?
>>
>> Lahiru
>>
>> --
>> System Analyst Programmer
>> PTI Lab
>> Indiana University
>>
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Mime
View raw message