aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Hrabovčin <martin.hrabov...@gmail.com>
Subject Re: mesos-log health check HTTP endpoint
Date Mon, 27 Jun 2016 11:21:44 GMT
I did following test:

I had cluster of 5 aurora instances, each instance had synced mesos-log
replica and was working correctly. All instances had configured mesos-log
path in ZK to /aurora/native-log. Then I've started simulating rolling
update by stopping one of the instances. I've removed replicated log data
(but I didn't run mesos-log initialize command) and I've pointed that
single instance mesos-log configuration to /aurora/native-log-nonexisting.
I used different mesos-log ZK path to simulate deployment problem and
mesos-log inability to sync. I started aurora and by watching logs I saw
that mesos-log replica got to EMPTY state. I let it running for 15 minutes
and there were no complains in logs, /health was returning 200 and
/leaderhealth 503 as expected. I've seen aurora crash eventually when I've
forced it to become leader by restarting other instances (while keeping
correct configuration).

As I am still pretty new to mesos-log and I can't say 100% that test proves
anything but I'd either expect aurora to crash sooner, not waiting to take
leadership. Since all health checks are passing correctly its hard to guess
state of mesos-log. I was hoping that it would be possible to try and read
from log and exception would mean its not ready.

2016-06-22 19:46 GMT+02:00 Bill Farner <wfarner@apache.org>:

> >
> > Once its started and it can open log it won't crash and starts mesos-log
> > recovery
>
>
> My memory is fuzzy here, but i was under the impression that holes in the
> log were filled before open() returned.  Have you observed otherwise?
>
> On Wed, Jun 22, 2016 at 8:02 AM, Martin Hrabovčin <
> martin.hrabovcin@gmail.com> wrote:
>
> > If there is some obvious issue with replicated log then open() call would
> > fail and caused aurora to exist or restart itself. I am looking at
> > different issue - If there are 3 aurora instances that needs the update
> its
> > hard to tell right now at which point its safe to move from one instance
> to
> > another. Lets say there is rolling update going and applying update on
> each
> > aurora instance at the time. One instance is down and out of rotation.
> Once
> > its started and it can open log it won't crash and starts mesos-log
> > recovery. But if you start doing upgrade on 2nd instance before mesos-log
> > is replicated to first one its easy to loose quorum and data. I'd like to
> > have some deterministic check that would allow to ensure that its safe to
> > consider log replicated.
> >
> > 2016-06-17 16:05 GMT+02:00 Bill Farner <wfarner@apache.org>:
> >
> > > If i recall correctly, the current implementation of the mesos log
> > requires
> > > that the callers handle mutually-exclusive access for reads and writes.
> > > This means that non-leading schdulers may not read or write to perform
> > the
> > > check you describe.
> > >
> > > What's the behavior of the scheduler when it starts and the log replica
> > is
> > > non-VOTING?  I thought the log open() call would fail, and the
> scheduler
> > > process would exit (giving a strong signal that the scheduler is not
> > > healthy).
> > >
> > > On Fri, Jun 17, 2016 at 2:44 AM, Martin Hrabovčin <
> > > martin.hrabovcin@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I was asking same question in #aurora channel and I still haven't
> found
> > > an
> > > > answer so I am bringing this in mailing list with a proposal.
> > > >
> > > > Is there a way to check the state of mesos-log (whether the its
> > writable
> > > in
> > > > VOTING state) through some HTTP check outside of aurora process on a
> > > > non-leading aurora instance? We are trying to create external check
> > that
> > > > would determine whether the mesos-log is ready in case of aurora
> > rolling
> > > > update. When adding new instance to existing aurora cluster and we
> want
> > > to
> > > > make sure that mesos-log is replicated and replica is ready to serve
> > > reads
> > > > and writes. Currently we’re grep-ing java process log and looking for
> > > > “Persisted replica status to VOTING”.
> > > >
> > > > I was pointed to /vars endpoint but I haven't found obvious answer
> > there.
> > > >
> > > > I'd like to propose creating new HTTP endpoint "/loghealth" that
> would
> > > > similarly to "/leaderhealth" return 200 when mesos-log is ready and
> 503
> > > in
> > > > case when mesos log throws exception. As for implementation I was
> > > thinking
> > > > about doing simple read from log or write noop to log directly.
> > > >
> > > > Thanks!
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message