mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jojy Varghese <j...@mesosphere.io>
Subject Re: MESOS-4581 Docker containers getting stuck in staging.
Date Wed, 03 Feb 2016 03:34:12 GMT
Hi Travis
 Thanks for narrowing down the issue. I had a brief look at your patch and it looks like it
relies on adding delay before inspect is called. Although that might work mostly, I am wondering
if that is the right solution. It would be better if we can have a timeout (using ‘after’
on the future) and retry inspect after timeout. We will have to discard the inspect future
thats in flight.

-Jojy


> On Feb 2, 2016, at 1:12 PM, Hegner, Travis <THegner@trilliumit.com> wrote:
> 
> I'd like to initiate a discussion on the following issue:
> 
> https://issues.apache.org/jira/browse/MESOS-4581
> 
> I've included a lot of detail in the JIRA, and would rather not reiterate _all_ of it
here on the list, but in short:
> 
> We are experiencing an issue when launching docker containers from marathon on mesos,
where the container actually starts on the slave node to which it's assigned, but mesos/marathon
get stuck in staging/staged respectively until the task launch times out and system tries
again to launch it elsewhere. This issue is random in nature, successfully starting tasks
about 40-50% of the time, while the rest of the time getting stuck.
> 
> We've been able to narrow this down to a possible race condition likely in docker itself,
but being triggered by the mesos-docker-executor. I have written and tested a patch in our
environment which seems to have eliminated the issue, however I feel that the patch could
be made more robust, and is currently just a work-around.
> 
> Thanks for your time and consideration of the issue.
> 
> Travis


Mime
View raw message