airavata-architecture mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Schwartz, Terri" <te...@sdsc.edu>
Subject RE: Fault Tolerant Use cases & Solutions for Job Management in Airavata
Date Tue, 01 Apr 2014 20:59:35 GMT
I struggled with this in cipres and looked at it much like Amila is saying.  Anywhere, I was
storing state, I would ask myself, "what happens if cipres (or its database) crashes right
before this or right after this?"  What will happen when cipres starts up again?  Will it
assume the operation didn't run and retry it and is that safe to do?  I generally update state
after initiating operations, not before, so don't have to deal with the possibility that we
said we did something we didn't actually do, just have to deal with the possibility that we
kicked something off and didn't manage to record it.

I tried to make operations idempotent as much as possible, sometimes by wrapping them in code
that looks for signs of a prior attempt and cleans things up before proceeding. 

Terri
________________________________________
From: Amila Jayasekara [thejaka.amila@gmail.com]
Sent: Tuesday, April 01, 2014 1:29 PM
To: architecture@airavata.apache.org
Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata

Hmm... If I explain this in PL concepts a state basically refers to an
environment (mapping of variables to their values) :-).

But in general applications (like Airavata) the state is represented by
what you persist. (Provided you persist right information)

E.g :- Consider getExperiments() API call. No matter how many times we call
this, this doesnt change the persisted data in the system. Therefore
function getExperiments() doesnt change the state. Therefore we can safely
exclude this method call when analyzing FT. Now consider addExperiment().
This adds an experiment to persistent storage and it changes the state. If
you are doing multiple transactions within addExperiment(), you need to
consider the resulting state if program fails in between each transaction.
If state is inconsistent then you need to come up with a solution.




On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <samindaw@gmail.com>wrote:

> Are you talking about modeling it similar to a state machine? if not can
> you elaborate what you meant by states in the system?
>
>
> On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <thejaka.amila@gmail.com
> >wrote:
>
> > One suggestion is to first identify states in the system. Then identify
> > actions (operation / method invocations) which change the state of the
> > system. Then model FT cases by analyzing system state after and before a
> > failure (during those operation invocations).
> >
> > Thanks
> > Amila
> >
> >
> > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <samindaw@gmail.com
> > >wrote:
> >
> > > Hi All,
> > >
> > > We are trying to identify scenarios in job management which is critical
> > to
> > > provide fault tolerant solutions. The spreadsheet[1] contains a list of
> > > such use cases I have compiled to the best of my knowledge (which is no
> > way
> > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > >
> > > I think it is particularly useful to learn how gateways like
> > > CIPRES/NSG/Ultrascan (who has a large user base) already handle these
> > > situations. Spreadsheet updated to record those as well.
> > >
> > > (if you don't have edit privileges just drop me a mail/reply)
> > >
> > > Thanks and Regards,
> > > Saminda
> > >
> > > 1.
> > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > >
> >
>

Mime
View raw message