airavata-architecture mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saminda Wijeratne <samin...@gmail.com>
Subject Re: Fault Tolerant Use cases & Solutions for Job Management in Airavata
Date Thu, 03 Apr 2014 18:18:58 GMT
Thanks everyone for the input. I think we have a good starting notes now to
start working on a FT design for Airavata Job Management. Next step would
be to list down the steps for each usecase mentioned in the spreadsheet in
terms of how Airavata does it in order to highlight where FT is necessary.

Will reply to this thread with any more questions we may have.


On Wed, Apr 2, 2014 at 7:34 AM, Schwartz, Terri <terri@sdsc.edu> wrote:

> Hi Saminda,
>
> Not sure I understand your question, but regarding the 2nd paragraph, like
> you said, I wanted to avoid problems like memory leaks or remote operations
> not timing out promptly, from impacting anything else.  Also, the separate
> processes can easily be run on different machines if we need to scale that
> way.
>
> Terri
> ________________________________________
> From: Saminda Wijeratne [samindaw@gmail.com]
> Sent: Tuesday, April 01, 2014 5:55 PM
> To: architecture
> Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
> Airavata
>
> Thanks Amila and Terri for your valuable insights.
>
> Combinning Terris' and Amilas' input, do you think the actions carried-out
> should be managed by internal action states or through states relating to
> various stages of an experiment? Do you have any thoughts on which design
> would be more flexible to follow?
>
> One other thing I saw in CIPRES is that you have reduced the risk of whole
> system going down because of failure of operation in one part of the system
> by separating the main activities in to different processes. i.e. CIPRES
> portal handles only user requests and 3 independent daemons handle
> different aspects of job management. Terri, any other advantages you've
> expected through this design?
>
> Thanks,
> Saminda
>
> On Tue, Apr 1, 2014 at 4:59 PM, Schwartz, Terri <terri@sdsc.edu> wrote:
>
> > I struggled with this in cipres and looked at it much like Amila is
> > saying.  Anywhere, I was storing state, I would ask myself, "what happens
> > if cipres (or its database) crashes right before this or right after
> this?"
> >  What will happen when cipres starts up again?  Will it assume the
> > operation didn't run and retry it and is that safe to do?  I generally
> > update state after initiating operations, not before, so don't have to
> deal
> > with the possibility that we said we did something we didn't actually do,
> > just have to deal with the possibility that we kicked something off and
> > didn't manage to record it.
> >
> > I tried to make operations idempotent as much as possible, sometimes by
> > wrapping them in code that looks for signs of a prior attempt and cleans
> > things up before proceeding.
> >
> > Terri
> > ________________________________________
> > From: Amila Jayasekara [thejaka.amila@gmail.com]
> > Sent: Tuesday, April 01, 2014 1:29 PM
> > To: architecture@airavata.apache.org
> > Subject: Re: Fault Tolerant Use cases & Solutions for Job Management in
> > Airavata
> >
> > Hmm... If I explain this in PL concepts a state basically refers to an
> > environment (mapping of variables to their values) :-).
> >
> > But in general applications (like Airavata) the state is represented by
> > what you persist. (Provided you persist right information)
> >
> > E.g :- Consider getExperiments() API call. No matter how many times we
> call
> > this, this doesnt change the persisted data in the system. Therefore
> > function getExperiments() doesnt change the state. Therefore we can
> safely
> > exclude this method call when analyzing FT. Now consider addExperiment().
> > This adds an experiment to persistent storage and it changes the state.
> If
> > you are doing multiple transactions within addExperiment(), you need to
> > consider the resulting state if program fails in between each
> transaction.
> > If state is inconsistent then you need to come up with a solution.
> >
> >
> >
> >
> > On Tue, Apr 1, 2014 at 4:13 PM, Saminda Wijeratne <samindaw@gmail.com
> > >wrote:
> >
> > > Are you talking about modeling it similar to a state machine? if not
> can
> > > you elaborate what you meant by states in the system?
> > >
> > >
> > > On Tue, Apr 1, 2014 at 4:00 PM, Amila Jayasekara <
> > thejaka.amila@gmail.com
> > > >wrote:
> > >
> > > > One suggestion is to first identify states in the system. Then
> identify
> > > > actions (operation / method invocations) which change the state of
> the
> > > > system. Then model FT cases by analyzing system state after and
> before
> > a
> > > > failure (during those operation invocations).
> > > >
> > > > Thanks
> > > > Amila
> > > >
> > > >
> > > > On Tue, Apr 1, 2014 at 3:49 PM, Saminda Wijeratne <
> samindaw@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We are trying to identify scenarios in job management which is
> > critical
> > > > to
> > > > > provide fault tolerant solutions. The spreadsheet[1] contains a
> list
> > of
> > > > > such use cases I have compiled to the best of my knowledge (which
> is
> > no
> > > > way
> > > > > complete). Thoughts are welcome (reply/comment or edit spreadsheet)
> > > > >
> > > > > I think it is particularly useful to learn how gateways like
> > > > > CIPRES/NSG/Ultrascan (who has a large user base) already handle
> these
> > > > > situations. Spreadsheet updated to record those as well.
> > > > >
> > > > > (if you don't have edit privileges just drop me a mail/reply)
> > > > >
> > > > > Thanks and Regards,
> > > > > Saminda
> > > > >
> > > > > 1.
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/spreadsheets/d/1eukcg2nXIoMzXa0GakNQVIICMd8y0UYGGjQs32232Hs/edit#gid=1448745788
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message