airavata-architecture mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amila Jayasekara <thejaka.am...@gmail.com>
Subject Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata
Date Wed, 02 Apr 2014 14:30:35 GMT
On Tue, Apr 1, 2014 at 8:55 PM, Saminda Wijeratne <samindaw@gmail.com>wrote:

> Thanks Amila and Terri for your valuable insights.
>
> Combinning Terris' and Amilas' input, do you think the actions carried-out
> should be managed by internal action states or through states relating to
> various stages of an experiment? Do you have any thoughts on which design
> would be more flexible to follow?
>

I am sorry, I didnt quite understand what you meant by "internal action
states".
I think Terry pointed out some of the questions you should ask when
designing FT. Also making operations idempotent as much as possible is
better. But we need to be careful about intermediate actions we do when
making operations idempotent. In other words we need be concern that
failures could occur in-between those intermediate actions.

Another way to look at this is as "atmoic transactions" (we either write
state or dont write at all). Also "2-phase commit" kind of protocols will
also be useful when implementing proper FT.

Sometime back I wrote document in [1] which is related to FT in Airavata.
But it is more focused on workflow based Airavata. You may refer to it, but
not sure whether you will be able to get anything out of it.

[1] https://drive.google.com/#folders/0B8luRDeqz22gYUdTVEdJS1ZMSkU

Thanks
Amila

>
> One other thing I saw in CIPRES is that you have reduced the risk of whole
> system going down because of failure of operation in one part of the system
> by separating the main activities in to different processes. i.e. CIPRES
> portal handles only user requests and 3 independent daemons handle
> different aspects of job management. Terri, any other advantages you've
> expected through this design?
>
> Thanks,
> Saminda
>
> On Tue, Apr 1, 2014 at 4:59 PM, Schwartz, Terri <terri@sdsc.edu> wrote:
>
> > I struggled with this in cipres and looked at it much like Amila is
> > saying.  Anywhere, I was storing state, I would ask myself, "what happens
> > if cipres (or its database) crashes right before this or right after
> this?"
> >  What will happen when cipres starts up again?  Will it assume the
> > operation didn't run and retry it and is that safe to do?  I generally
> > update state after initiating operations, not before, so don't have to
> deal
> > with the possibility that we said we did something we didn't actually do,
> > just have to deal with the possibility that we kicked something off and
> > didn't manage to record it.
> >
> > I tried to make operations idempotent as much as possible, sometimes by
> > wrapping them in code that looks for signs of a prior attempt and cleans
> > things up before proceeding.
> >
> > Terri
> > ________________________________________
> > From: Amila Jayasekara [thejaka.amila@gmail.com]
> > Sent: Tuesday, April 01, 2014 1:29 PM
> > To: architecture@airavata.apache.org
> > Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
> > Airavata
> >
> > Hmm... If I explain this in PL concepts a state basically refers to an
> > environment (mapping of variables to their values) :-).
> >
> > But in general applications (like Airavata) the state is represented by
> > what you persist. (Provided you persist right information)
> >
> > E.g :- Consider getExperiments() API call. No matter how many times we
> call
> > this, this doesnt change the persisted data in the system. Therefore
> > function getExperiments() doesnt change the state. Therefore we can
> safely
> > exclude this method call when analyzing FT. Now consider addExperiment().
> > This adds an experiment to persistent storage and it changes the state.
> If
> > you are doing multiple transactions within addExperiment(), you need to
> > consider the resulting state if program fails in between each
> transaction.
> > If state is inconsistent then you need to come up with a solution.
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <samindaw@gmail.com
> > >wrote:
> >
> > > Are you talking about modeling it similar to a state machine? if not
> can
> > > you elaborate what you meant by states in the system?
> > >
> > >
> > > On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <
> > thejaka.amila@gmail.com
> > > >wrote:
> > >
> > > > One suggestion is to first identify states in the system. Then
> identify
> > > > actions (operation / method invocations) which change the state of
> the
> > > > system. Then model FT cases by analyzing system state after and
> before
> > a
> > > > failure (during those operation invocations).
> > > >
> > > > Thanks
> > > > Amila
> > > >
> > > >
> > > > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <
> samindaw@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We are trying to identify scenarios in job management which is
> > critical
> > > > to
> > > > > provide fault tolerant solutions. The spreadsheet[1] contains a
> list
> > of
> > > > > such use cases I have compiled to the best of my knowledge (which
> is
> > no
> > > > way
> > > > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > > > >
> > > > > I think it is particularly useful to learn how gateways like
> > > > > CIPRES/NSG/Ultrascan (who has a large user base) already handle
> these
> > > > > situations. Spreadsheet updated to record those as well.
> > > > >
> > > > > (if you don't have edit privileges just drop me a mail/reply)
> > > > >
> > > > > Thanks and Regards,
> > > > > Saminda
> > > > >
> > > > > 1.
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message