airavata-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lahiru Gunathilake <glah...@gmail.com>
Subject Re: Airavata Orchestrator component
Date Mon, 09 Dec 2013 14:12:47 GMT
Hi Raman,


On Fri, Dec 6, 2013 at 12:34 PM, Raminder Singh <rsandhu1@gmail.com> wrote:

> Lahiru: Can you please start a document to record this conversation? There
> are very valuable points to records and don’t want to loose anything in
> email threads.
>
> My comments are inline with prefix RS>>:
>
> On Dec 5, 2013, at 10:12 PM, Lahiru Gunathilake <glahiru@gmail.com> wrote:
>
> Hi Amila,
>
> I have answered questions you raised except some how to questions (for how
> questions we need to figure out solutions, before that we need to come up
> with good design).
>
>
> On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara <thejaka.amila@gmail.com>wrote:
>
>>
>>
>>
>> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <glahiru@gmail.com>wrote:
>>
>>> Hi All,
>>>
>>> We are thinking of implementing an Airavata Orchestrator component to
>>> replace WorkflowInterpreter to avoid gateway developers to dealing with
>>> workflows when they simply have one single independent jobs to run in their
>>> gateways. This component is mainly focusing on how to invoke GFAC and
>>> accept requests from the client API.
>>>
>>> I have following features in mind about this component.
>>>
>>> 1. It gives a web services or REST interface where we can implement a
>>> client to invoke it to submit jobs.
>>>
>> RS >> We need a API method to handle this and protocol interfacing of API
> can be handled separately using Thrift or Web services.
>
>
>>> 2. Accepts a job request and parse the input types and if input types
>>> are correct, this will create an Airavata experiment ID.
>>>
>> RS >> According to me, we need to save every request to registry before
> verification and have a input configuration error if the inputs were not
> correct. That will help us to find if there were any API invocation errors.
>
+1, we need to save the request to registry right away.

>
>
>>> 3. Orchestrtor then store the job information to registry against the
>>> generated experiment ID (All the other components identify the job using
>>> this experiment ID).
>>>
>>> 4. After that Orchestrator pull up all the descriptors related to this
>>> request and do some scheduling to decide where to run the job and submit
>>> the job to a GFAC node (Handling multiple GFAC nodes is going to be a
>>> future improvement in Orchestrator).
>>>
>>> If we are trying to do pull based job submission it might be a good idea
>>> to handle errors, if we store jobs to Registry and GFAC pull jobs and
>>> execute them Orchestrator component really doesn' t have to worry about the
>>> error handling.
>>>
>>
>> I did not quite understand what you meant by "pull based job
>> submission". I believe it is saving job in registry and periodically GFAC
>> looking up for new jobs and submitting them.
>>
> Yes.
>
> RS >> I think orchestrator should call GFAC to invoke the job than GFAC
> polling for the jobs. Orchestrator should make a decision that to which
> instance of GFAC it submit the job and if there is a system error then
> bring up or communicate to another instance.I think pull based model for
> GFAC will add an overhead. We will add another point of failure.
>
Can you please explain bit more what did you mean by "another point of
failure" and "add an overhead".

>
> Further why are you saying you dont need to worry about error handling ?
>> What sort of errors are you considering ?
>>
> I am considering GFAC failures or connection between Orchestrator and GFAC
> goes down.
>
>>
>>
>>>
>>> Because we can implement a logic to GFAC if a particular job is not
>>> updating its status fora g iven time it assume job is hanged or either GFAC
>>> node which handles that job is fauiled, so  GFAC pull that job (we
>>> definitely need a locking mechanism here, to avoid two instances are not
>>> going to  execute hanged job) and  start execute it. (If GFAC is handling a
>>> long running job still it has to update the job stutus frequently with the
>>> same status to make sure GFAC node is running).
>>>
>>
>> I have some comments/questions on this regard;
>>
>> 1. How are you going to detect that job is hanged ?
>>
>> 2. We clearly need to distinguish between fault jobs and fault GFAC
>> instances. Because GFAC replication should not pick the job if its logic is
>> leading to hang situation.
>>
> I haven't seen hanged logic situation, may be there are.
>
>> GFAC replication should pick the job only if primary GFAC instance is
>> down. I believe you proposed locking mechanism to handle this scenario. But
>> I dont see how locking mechanism going to resolve this situation. Can you
>> explain more ?
>>
> Example if gfac has an logic of picking up a job which didn't response in
> a given time there could be a scenario where two gfac instances try to pick
> the same job. Ex: there are 3 gfac nodes working and one goes down with a
> given job. And two other nodes recognize this at the same time and try to
> launch the sam ejob. I was talking about locks to fix this issue.
>
> RS >> One way to handle is to look at job walltime. If the walltime for a
> running job is expired and we still don’t have the status of the job then
> we can go ahead and check the status and start cleaning up the job.
>

>
>> 2. According to your description, it seems there is no communication
>> between GFAC instance and Orchastrator.So GFAC and Orchastrator exchange
>> data through registry (Database). Performance might drop since we are going
>> through persisting mediums.
>>
> Yes you are correct, I am assuming we are mostly focusing on implementing
> more reliable system and most of these jobs are running hours, and we don't
> need to implement high performance system for a system with  long running
> jobs.
>
> RS >> We need to discuss this. I think orchestrator should only maintain
> state of request not GFAC.
>
>
>> 3. What is the strategy to divide jobs among GFAC instances ?
>>
> Not sure, we have to discuss it.
>
>
>> 4. How to identify GFAC instance is failed ?
>>
>
>> 5. How GFAC instances should be registered with the orchestrator ?
>>
> RS >> We need to have a mechanism which record how many GFAC instance are
> running and how many jobs per instance.
>
If we are going to do pull based model its going to be a hassle otherwise
orchestrator can keep track of that.

>
>
>> 6. How job cancellations are handled ?
>>
> RS >> Single job canceling is simple and should have a API function to
> cancel based on experiment id and/or local job id.
>
>
>> 7. What happend if Orchestrator goes down ?
>>
> This is under assumption Orchestrator doesn't go down (Ex: as a Head node
> in Map reduce).
>
> RS >> I think registration of job happen outside orchestrator and
> orchestrator/GFAC progress the states.
>
>
>
Regards
Lahiru

-- 
System Analyst Programmer
PTI Lab
Indiana University

Mime
View raw message