mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From haosdent <haosd...@gmail.com>
Subject Re: MESOS-4581 Docker containers getting stuck in staging.
Date Wed, 03 Feb 2016 16:14:26 GMT
>Should Docker::inspect return to the caller on timeout, or follow a retry
system like it does with a non-zero exit code?
Yes, this depends how you implement your timeout logic in `after`. If you
discard future and return Failure in `after`, you would not continue the
future chain. But you also could retry in `after`. For example, because
here have already use `promise`(Another similar concept to Future). You
could retry that by
```
      Clock::timer(retryInterval.get(),
                   [=]() { _inspect(cmd, promise, retryInterval); } );
```
as exist retry logic in `Docker::__inspect`.

>I still believe the initial delay is important
I think it is OK if we always fail without delay. After Docker fix this
problem, I think could remove this delay again.

Another suggestion is I think we should continue the implement discussion
in JIRA or reviewboard.


On Wed, Feb 3, 2016 at 11:40 PM, Hegner, Travis <THegner@trilliumit.com>
wrote:

> Thanks haosdent, that makes sense to allow the inspect command to time out
> on its own. Should Docker::inspect return to the caller on timeout, or
> follow a retry system like it does with a non-zero exit code? If the
> latter, should it retry forever if it times out every time?
>
> Are you suggesting this in addition to the patch I have introducing the
> delay so far? I still believe the initial delay is important, as without it
> we would encounter the docker inspect timeout very frequently in our
> environment.
>
> Thanks,
>
> Travis
>
> -----Original Message-----
> From: haosdent [mailto:haosdent@gmail.com]
> Sent: Wednesday, February 3, 2016 10:05 AM
> To: dev <dev@mesos.apache.org>
> Subject: Re: MESOS-4581 Docker containers getting stuck in staging.
>
> Hi, Hegner. I think Joly means we change Docker::inspect from ```
>   s.get().status()
>     .onAny([=]() { __inspect(cmd, promise, retryInterval, output,
> s.get()); }); ``` to ```
>   s.get().status()
>     .after(timeoutSeconds, []() {...})
>     .onAny([=]() { __inspect(cmd, promise, retryInterval, output,
> s.get()); }); ```
>
> Suppose if 500ms not works in different environments, we still would go
> into same problem. I think add timeout limit in inspect would be better.
>
> On Wed, Feb 3, 2016 at 10:05 PM, Hegner, Travis <THegner@trilliumit.com>
> wrote:
>
> > Thanks Jojy,
> >
> > I agree that my current patch is not the most ideal solution, as it
> > was intended as a proof of concept to ensure that it was in fact the
> > timing that was causing the problems I've been experiencing. Perhaps I
> > didn't make that clear enough in the JIRA. That is also the reason I
> > haven't attempted to submit that patch back.
> >
> > I was hoping to open a dialog to discuss the most appropriate fix for
> > that error, as I am relatively new to the future/promise paradigm, and
> > I haven't done much in c++ in 10 years.
> >
> > I'm assuming that when you say "using 'after' on the future" you are
> > referring to the initial inspect call's future correct? In that case,
> > I'm not convinced that is the right solution either. If you dig into
> > the inspect call, you'll see that it has a fairly sophisticated retry
> > system itself. However, that retry system is designed around whether
> > the actual inspect command completes its execution. If the inspect
> > command runs forever, then that thread blocks waiting for it forever.
> >
> > I am leaning towards a combination of both. I believe that it is
> > appropriate to wait a small amount of time to call the initial
> > inspect, to give docker a chance to create and start the container. I
> > see your point in that there is still the possibility of a race
> > condition, if the run command is delayed for the right amount of time
> > itself, I understand that the run thread could be executed any time
> > after the future is created. When running the vanilla mesos (without
> > my current patch), remember that the containers that launch
> > successfully, actually do so because the initial inspect fails
> > (returns non 0 as the container doesn’t exist yet), and the inspect
> > function's built in retry system delays 500ms and runs again, this time
> successfully. (See these logs to see what I mean:
> > https://gist.github.com/travishegner/dcb5797f5a0919e45618)
> >
> > With that in mind, we wouldn't actually know whether the inspect is
> > just delayed in its own retry pattern, or is actually hung, unless we
> > wait a large enough amount of time. Given how frequently we encounter
> > this issue in our environment (more than half of the attempted
> > container launches), waiting a large amount of time to indicate
> > whether the container is running would be undesirable.
> >
> > How do you feel about an initial delay (how long 500ms?), in concert
> > with an ".after()" (how long 3000ms?) on the inspect future that will
> > discard the original inspect and try again? The initial delay would
> > give docker time to create and start our container, and in my
> > environment drastically reduce the frequency with which we are met
> > with this condition. Ironically, the variable name
> > (DOCKER_INSPECT_DELAY) of the retry delay, seems to indicate that the
> > author expected a delay to be inserted before the inspect call anyway,
> > but in reading through the code, that variable is only used as the retry
> interval, and the initial docker inspect is executed immediately.
> > A timeout/retry on the inspect call helps to guard against this type
> > of issue, should the run and inspect commands be launched
> > simultaneously for whatever reason.
> >
> > A final thought: are we looking at this from the wrong perspective?
> > Should the inspect call itself be modified to handle the case when the
> > inspect command never returns?
> >
> > Thanks for your thoughts,
> >
> > Travis
> >
> > -----Original Message-----
> > From: Jojy Varghese [mailto:jojy@mesosphere.io]
> > Sent: Tuesday, February 2, 2016 10:34 PM
> > To: dev@mesos.apache.org
> > Subject: Re: MESOS-4581 Docker containers getting stuck in staging.
> >
> > Hi Travis
> >  Thanks for narrowing down the issue. I had a brief look at your patch
> > and it looks like it relies on adding delay before inspect is called.
> > Although that might work mostly, I am wondering if that is the right
> > solution. It would be better if we can have a timeout (using ‘after’
> > on the future) and retry inspect after timeout. We will have to
> > discard the inspect future thats in flight.
> >
> > -Jojy
> >
> >
> > > On Feb 2, 2016, at 1:12 PM, Hegner, Travis <THegner@trilliumit.com>
> > wrote:
> > >
> > > I'd like to initiate a discussion on the following issue:
> > >
> > > https://issues.apache.org/jira/browse/MESOS-4581
> > >
> > > I've included a lot of detail in the JIRA, and would rather not
> > reiterate _all_ of it here on the list, but in short:
> > >
> > > We are experiencing an issue when launching docker containers from
> > marathon on mesos, where the container actually starts on the slave
> > node to which it's assigned, but mesos/marathon get stuck in
> > staging/staged respectively until the task launch times out and system
> > tries again to launch it elsewhere. This issue is random in nature,
> > successfully starting tasks about 40-50% of the time, while the rest of
> the time getting stuck.
> > >
> > > We've been able to narrow this down to a possible race condition
> > > likely
> > in docker itself, but being triggered by the mesos-docker-executor. I
> > have written and tested a patch in our environment which seems to have
> > eliminated the issue, however I feel that the patch could be made more
> > robust, and is currently just a work-around.
> > >
> > > Thanks for your time and consideration of the issue.
> > >
> > > Travis
> >
> >
>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Best Regards,
Haosdent Huang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message