aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: mesos-log health check HTTP endpoint
Date Wed, 22 Jun 2016 17:46:10 GMT
>
> Once its started and it can open log it won't crash and starts mesos-log
> recovery


My memory is fuzzy here, but i was under the impression that holes in the
log were filled before open() returned.  Have you observed otherwise?

On Wed, Jun 22, 2016 at 8:02 AM, Martin Hrabovčin <
martin.hrabovcin@gmail.com> wrote:

> If there is some obvious issue with replicated log then open() call would
> fail and caused aurora to exist or restart itself. I am looking at
> different issue - If there are 3 aurora instances that needs the update its
> hard to tell right now at which point its safe to move from one instance to
> another. Lets say there is rolling update going and applying update on each
> aurora instance at the time. One instance is down and out of rotation. Once
> its started and it can open log it won't crash and starts mesos-log
> recovery. But if you start doing upgrade on 2nd instance before mesos-log
> is replicated to first one its easy to loose quorum and data. I'd like to
> have some deterministic check that would allow to ensure that its safe to
> consider log replicated.
>
> 2016-06-17 16:05 GMT+02:00 Bill Farner <wfarner@apache.org>:
>
> > If i recall correctly, the current implementation of the mesos log
> requires
> > that the callers handle mutually-exclusive access for reads and writes.
> > This means that non-leading schdulers may not read or write to perform
> the
> > check you describe.
> >
> > What's the behavior of the scheduler when it starts and the log replica
> is
> > non-VOTING?  I thought the log open() call would fail, and the scheduler
> > process would exit (giving a strong signal that the scheduler is not
> > healthy).
> >
> > On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin <
> > martin.hrabovcin@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > I was asking same question in #aurora channel and I still haven't found
> > an
> > > answer so I am bringing this in mailing list with a proposal.
> > >
> > > Is there a way to check the state of mesos-log (whether the its
> writable
> > in
> > > VOTING state) through some HTTP check outside of aurora process on a
> > > non-leading aurora instance? We are trying to create external check
> that
> > > would determine whether the mesos-log is ready in case of aurora
> rolling
> > > update. When adding new instance to existing aurora cluster and we want
> > to
> > > make sure that mesos-log is replicated and replica is ready to serve
> > reads
> > > and writes. Currently we’re grep-ing java process log and looking for
> > > “Persisted replica status to VOTING”.
> > >
> > > I was pointed to /vars endpoint but I haven't found obvious answer
> there.
> > >
> > > I'd like to propose creating new HTTP endpoint "/loghealth" that would
> > > similarly to "/leaderhealth" return 200 when mesos-log is ready and 503
> > in
> > > case when mesos log throws exception. As for implementation I was
> > thinking
> > > about doing simple read from log or write noop to log directly.
> > >
> > > Thanks!
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message