airavata-architecture mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lahiru Gunathilake <glah...@gmail.com>
Subject Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata
Date Wed, 02 Apr 2014 02:24:15 GMT
Actually I am planning to do a state diagram and sequence diagram for
airavata backend. Will post it soon.


On Tue, Apr 1, 2014 at 8:55 PM, Saminda Wijeratne <samindaw@gmail.com>wrote:

> Thanks Amila and Terri for your valuable insights.
>
> Combinning Terris' and Amilas' input, do you think the actions carried-out
> should be managed by internal action states or through states relating to
> various stages of an experiment? Do you have any thoughts on which design
> would be more flexible to follow?
>
> One other thing I saw in CIPRES is that you have reduced the risk of whole
> system going down because of failure of operation in one part of the system
> by separating the main activities in to different processes. i.e. CIPRES
> portal handles only user requests and 3 independent daemons handle
> different aspects of job management. Terri, any other advantages you've
> expected through this design?
>
> Thanks,
> Saminda
>
> On Tue, Apr 1, 2014 at 4:59 PM, Schwartz, Terri <terri@sdsc.edu> wrote:
>
> > I struggled with this in cipres and looked at it much like Amila is
> > saying.  Anywhere, I was storing state, I would ask myself, "what happens
> > if cipres (or its database) crashes right before this or right after
> this?"
> >  What will happen when cipres starts up again?  Will it assume the
> > operation didn't run and retry it and is that safe to do?  I generally
> > update state after initiating operations, not before, so don't have to
> deal
> > with the possibility that we said we did something we didn't actually do,
> > just have to deal with the possibility that we kicked something off and
> > didn't manage to record it.
> >
> > I tried to make operations idempotent as much as possible, sometimes by
> > wrapping them in code that looks for signs of a prior attempt and cleans
> > things up before proceeding.
> >
> > Terri
> > ________________________________________
> > From: Amila Jayasekara [thejaka.amila@gmail.com]
> > Sent: Tuesday, April 01, 2014 1:29 PM
> > To: architecture@airavata.apache.org
> > Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
> > Airavata
> >
> > Hmm... If I explain this in PL concepts a state basically refers to an
> > environment (mapping of variables to their values) :-).
> >
> > But in general applications (like Airavata) the state is represented by
> > what you persist. (Provided you persist right information)
> >
> > E.g :- Consider getExperiments() API call. No matter how many times we
> call
> > this, this doesnt change the persisted data in the system. Therefore
> > function getExperiments() doesnt change the state. Therefore we can
> safely
> > exclude this method call when analyzing FT. Now consider addExperiment().
> > This adds an experiment to persistent storage and it changes the state.
> If
> > you are doing multiple transactions within addExperiment(), you need to
> > consider the resulting state if program fails in between each
> transaction.
> > If state is inconsistent then you need to come up with a solution.
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <samindaw@gmail.com
> > >wrote:
> >
> > > Are you talking about modeling it similar to a state machine? if not
> can
> > > you elaborate what you meant by states in the system?
> > >
> > >
> > > On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <
> > thejaka.amila@gmail.com
> > > >wrote:
> > >
> > > > One suggestion is to first identify states in the system. Then
> identify
> > > > actions (operation / method invocations) which change the state of
> the
> > > > system. Then model FT cases by analyzing system state after and
> before
> > a
> > > > failure (during those operation invocations).
> > > >
> > > > Thanks
> > > > Amila
> > > >
> > > >
> > > > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <
> samindaw@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We are trying to identify scenarios in job management which is
> > critical
> > > > to
> > > > > provide fault tolerant solutions. The spreadsheet[1] contains a
> list
> > of
> > > > > such use cases I have compiled to the best of my knowledge (which
> is
> > no
> > > > way
> > > > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > > > >
> > > > > I think it is particularly useful to learn how gateways like
> > > > > CIPRES/NSG/Ultrascan (who has a large user base) already handle
> these
> > > > > situations. Spreadsheet updated to record those as well.
> > > > >
> > > > > (if you don't have edit privileges just drop me a mail/reply)
> > > > >
> > > > > Thanks and Regards,
> > > > > Saminda
> > > > >
> > > > > 1.
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > > > >
> > > >
> > >
> >
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message