Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 06E8017EB4 for ; Wed, 2 Sep 2015 17:44:40 +0000 (UTC) Received: (qmail 3088 invoked by uid 500); 2 Sep 2015 17:44:39 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 3028 invoked by uid 500); 2 Sep 2015 17:44:39 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 3018 invoked by uid 99); 2 Sep 2015 17:44:39 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Sep 2015 17:44:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2B986182108 for ; Wed, 2 Sep 2015 17:44:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id cM7aKAiza7oO for ; Wed, 2 Sep 2015 17:44:26 +0000 (UTC) Received: from mail-wi0-f170.google.com (mail-wi0-f170.google.com [209.85.212.170]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id C630942B27 for ; Wed, 2 Sep 2015 17:44:25 +0000 (UTC) Received: by wibz8 with SMTP id z8so73974687wib.1 for ; Wed, 02 Sep 2015 10:44:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=4j8TTnSYEyIcr1+FS8G71gkZrpM0Mw39hmUoGeS96k0=; b=wW4ewejHD7Oc95g5baHuWKqx7qx3jzCG5KrzW+fHq9o3vShlr1/Ph7+y+0XfVOSKc2 msayZZ/UsrzLGTcA6rzr9ePczNQ2mQ6Cz63X9WHnEQ4dXJGT0lajQVK1ISlMzbE7YVjq PbSxvvaPMKNlVQDQQu1zJCB+7P4dmKUsjMLEYHJ+vH5QdJzQMT4jyjKjyT+TYW+e6v40 eqeiP0C/8YxkinWZPMvY0Una4iJQThwYDOCwrzkVHVZq//FnFT2Ni7iddCg1k32Axecy dcGswUTCffXLMdLOi8qjkK8a5jmK+pgnZEdlBueDoFoWeronyyjkAv/nEJ3+BDvVNiwN T2TA== MIME-Version: 1.0 X-Received: by 10.194.236.161 with SMTP id uv1mr44737927wjc.158.1441215864940; Wed, 02 Sep 2015 10:44:24 -0700 (PDT) Received: by 10.28.59.212 with HTTP; Wed, 2 Sep 2015 10:44:24 -0700 (PDT) In-Reply-To: References: <8C987AB6-5C58-4683-A4AA-1A168EC15243@motus.com> Date: Thu, 3 Sep 2015 01:44:24 +0800 Message-ID: Subject: Re: mesos-slave crashing with CHECK_SOME From: haosdent To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=089e0141a152acbcbe051ec737b1 --089e0141a152acbcbe051ec737b1 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable If could show the content of path in CHECK_SOME, it would more easy to debug here. According the log in https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and 0.22.1 code: const string& path =3D paths::getExecutorSentinelPath( metaDir, info.id(), framework->id, executor->id, executor->containerId); framework->id =3D=3D> 20141209-011108-1378273290-5050-23221-0001 executor->id =3D=3D> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab' metaDir could get from your slave work_dir, info.id() is your slave id, could you see the executor->containerId in complete slave log. And if you could reproduce this problem every time, it would very helpful if you add a trace log to slave and recompile it. On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen wrote: > Hi Scott, > > I wonder if you can try the latest Mesos and see if you can repro this? > > And if it is can you put down the example task and steps? I couldn't see > disk full in your slave log so I'm not sure if it's exactly the same > problem of MESOS-2684. > > Tim > > On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin wrote: > >> Hi Marco, >> >> I certainly don=E2=80=99t want to start a flame war, and I actually real= ized >> after I added my comment to MESOS-2684 that it=E2=80=99s not quite the s= ame thing. >> >> As far as I can tell, in our situation, there=E2=80=99s no underlying di= sk >> issue. It seems like this is some sort of race condition (maybe?) with >> docker containers and executors shutting down. I=E2=80=99m perfectly ha= ppy with >> Mesos choosing to shut down in the case of a failure or unexpected >> situation =E2=80=93 that=E2=80=99s a methodology that we adopt ourselves= . I=E2=80=99m just trying >> to get a little more information about what the underlying issue is so t= hat >> we can resolve it. I don=E2=80=99t know enough about Mesos internals to = be able to >> answer that question just yet. >> >> It=E2=80=99s also inconvenient because, while Mesos is well-behaved and = restarts >> gracefully, as of 0.22.1, it=E2=80=99s not recovering the Docker executo= rs =E2=80=93 so a >> mesos-slave crash also brings down applications. >> >> Thanks, >> Scott >> >> From: Marco Massenzio >> Reply-To: "user@mesos.apache.org" >> Date: Tuesday, September 1, 2015 at 7:33 PM >> To: "user@mesos.apache.org" >> Subject: Re: mesos-slave crashing with CHECK_SOME >> >> That's one of those areas for discussions that is so likely to generate = a >> flame war that I'm hesitant to wade in :) >> >> In general, I would agree with the sentiment expressed there: >> >> > If the task fails, that is unfortunate, but not the end of the world. >> Other tasks should not be affected. >> >> which is, in fact, to large extent exactly what Mesos does; the example >> given in MESOS-2684, as it happens, is for a "disk full failure" - carry= ing >> on as if nothing had happened, is only likely to lead to further (and >> worse) disappointment. >> >> The general philosophy back at Google (and which certainly informs the >> design of Borg[0]) was "fail early, fail hard" so that either (a) the >> service is restarted and hopefully the root cause cleared or (b) someone >> (who can hopefully do something) will be alerted about it. >> >> I think it's ultimately a matter of scale: up to a few tens of servers, >> you can assume there is some sort of 'log-monitor' that looks out for >> errors and other anomalies and alerts humans that will then take a look = and >> possibly apply some corrective action - when you're up to hundreds or >> thousands (definitely Mesos territory) that's not practical: the system >> should either self-heal or crash-and-restart. >> >> All this to say, that it's difficult to come up with a general >> *automated* approach to unequivocally decide if a failure is "fatal" or >> could just be safely "ignored" (after appropriate error logging) - in >> general, when in doubt it's probably safer to "noisily crash & restart" = and >> rely on the overall system's HA architecture to take care of replication >> and consistency. >> (and an intelligent monitoring system that only alerts when some failure >> threshold is exceeded). >> >> From what I've seen so far (granted, still a novice here) it seems that >> Mesos subscribes to this notion, assuming that Agent Nodes will come and >> go, and usually Tasks survive (for a certain amount of time anyway) a Sl= ave >> restart (obviously, if the physical h/w is the ultimate cause of failure= , >> well, then all bets are off). >> >> Having said all that - if there are areas where we have been over-eager >> with our CHECKs, we should definitely revisit that and make it more >> crash-resistant, absolutely. >> >> [0] http://research.google.com/pubs/pub43438.html >> >> *Marco Massenzio* >> >> *Distributed Systems Engineer http://codetrips.com * >> >> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker < >> sschlansker@opentable.com> wrote: >> >>> >>> >>> On Aug 31, 2015, at 11:54 AM, Scott Rankin wrote: >>> > >>> > tag=3Dmesos-slave[12858]: F0831 09:37:29.838184 12898 slave.cpp:3354= ] >>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or direc= tory >>> >>> I reported a similar bug a while back: >>> >>> https://issues.apache.org/jira/browse/MESOS-2684 >>> >>> This seems to be a class of bugs where some filesystem operations which >>> may fail for unforeseen reasons are written as assertions which crash t= he >>> process, rather than failing only the task and communicating back the e= rror >>> reason. >>> >>> >>> >> This email message contains information that Motus, LLC considers >> confidential and/or proprietary, or may later designate as confidential = and >> proprietary. It is intended only for use of the individual or entity nam= ed >> above and should not be forwarded to any other persons or entities witho= ut >> the express consent of Motus, LLC, nor should it be used for any purpose >> other than in the course of any potential or actual business relationshi= p >> with Motus, LLC. If the reader of this message is not the intended >> recipient, or the employee or agent responsible to deliver it to the >> intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication is strictly prohibited. I= f >> you have received this communication in error, please notify sender >> immediately and destroy the original message. >> >> Internal Revenue Service regulations require that certain types of >> written advice include a disclaimer. To the extent the preceding message >> contains advice relating to a Federal tax issue, unless expressly stated >> otherwise the advice is not intended or written to be used, and it canno= t >> be used by the recipient or any other taxpayer, for the purpose of avoid= ing >> Federal tax penalties, and was not written to support the promotion or >> marketing of any transaction or matter discussed herein. >> > > --=20 Best Regards, Haosdent Huang --089e0141a152acbcbe051ec737b1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
If could show the content of path in CHECK_SOME, it would = more easy to debug here. According the log in https://groups.google= .com/forum/#!topic/marathon-framework/oKXhfQUcoMQ=C2=A0and 0.22.1 code:=

=C2=A0 =C2=A0 const string& path =3D paths::ge= tExecutorSentinelPath(
=C2=A0 =C2=A0 =C2=A0 =C2=A0 metaDir, info.id(), framework->id, executor->id, exe= cutor->containerId);

framework->id =3D= =3D> 20141209-011108-1378273290-5050-23221-0001=C2=A0
executor= ->id =3D=3D>=C2=A0tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab'
=

metaDir could get from your slave work_dir, info.id() is your slave id, could you see the exe= cutor->containerId in complete slave log. And if you could reproduce thi= s problem every time, it would very helpful if you add a trace log to slave= and recompile it.=C2=A0

On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <tim@mesosphe= re.io> wrote:
Hi Scott,

I wonder if you can try the latest Mesos a= nd see if you can repro this?

And if it is can you= put down the example task and steps? I couldn't see disk full in your = slave log so I'm not sure if it's exactly the same problem of MESOS= -2684.

<= div>Tim

On Wed, Sep 2, 2015= at 5:15 AM, Scott Rankin <srankin@motus.com> wrote:
Hi Marco,=C2=A0

I certainly don=E2=80=99t want to start a flame war, and I actually re= alized after I added my comment to MESOS-2684 that it=E2=80=99s not quite t= he same thing. =C2=A0

As far as I can tell, in our situation, there=E2=80=99s no underlying = disk issue.=C2=A0 It seems like this is some sort of race condition (maybe?= ) with docker containers and executors shutting down.=C2=A0 I=E2=80=99m per= fectly happy with Mesos choosing to shut down in the case of a failure or unexpected situation =E2=80=93 that=E2=80=99s a methodolog= y that we adopt ourselves.=C2=A0 I=E2=80=99m just trying to get a little mo= re information about what the underlying issue is so that we can resolve it= . I don=E2=80=99t know enough about Mesos internals to be able to answer that question just yet.=C2=A0

It=E2=80=99s also inconvenient because, while Mesos is well-behaved an= d restarts gracefully, as of 0.22.1, it=E2=80=99s not recovering the Docker= executors =E2=80=93 so a mesos-slave crash also brings down applications. = =C2=A0

Thanks,
Scott

From: Marco Massenzio
Reply-To: "user@mesos.apache.org"
Date: Tuesday, September 1, 2015 at= 7:33 PM
To: "user@mesos.apache.org"
Subject: Re: mesos-slave crashing w= ith CHECK_SOME

That's one of those = areas for discussions that is so likely to generate a flame war that I'= m hesitant to wade in :)

In general, I would agree with the s= entiment expressed there:

>=C2=A0If the task fails, that is unfort= unate, but not the end of the world. Other tasks should not be affected.

which is, in fact, to large extent exactly = what Mesos does; the example given in MESOS-2684, as it happens, is for a &= quot;disk full failure" - carrying on as if nothing had happened, is only likely to lead to further (and worse) dis= appointment.

The general philosophy back at Google (and = which certainly informs the design of Borg[0]) was "fail early, fail h= ard" so that either (a) the service is restarted and hopefully the root cause cleared or (b) someone (who can hopefully do = something) will be alerted about it.

I think it's ultimately a matter of sca= le: up to a few tens of servers, you can assume there is some sort of '= log-monitor' that looks out for errors and other anomalies and alerts humans that will then take a look and possibly apply = some corrective action - when you're up to hundreds or thousands (defin= itely Mesos territory) that's not practical: the system should either s= elf-heal or crash-and-restart.

All this to say, that it's difficult to= come up with a general *automated*=C2=A0approach=C2=A0to unequivocally decide if a failure is "fatal" or could just be sa= fely "ignored" (after=C2=A0appropriate error logging) - in genera= l, when in doubt it's probably safer to "noisily crash & resta= rt" and rely on the overall system's HA architecture to take care = of replication and consistency.
(and an intelligent monitoring system that = only alerts when some failure threshold is exceeded).

From what I've seen so far (granted, st= ill a novice here) it seems that Mesos subscribes to this notion, assuming = that Agent Nodes will come and go, and usually Tasks survive (for a certain amount of time anyway) a Slave restart (obvio= usly, if the physical h/w is the ultimate cause of failure, well, then all = bets are off).

Having said all that - if there are areas w= here we have been over-eager with our CHECKs, we should definitely revisit = that and make it more crash-resistant, absolutely.


Marco Massenzio
Distributed Systems Engineer
http://codetrips.com=

On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlans= ker <sschlans= ker@opentable.com> wrote:


On Aug 31, 2015, at 11:54 AM, Scott Rankin <srankin@motus.com> wrote:
>
> tag=3Dmesos-slave[12858]:=C2=A0 F0831 09:37:29.838184 12898 slave.cpp:= 3354] CHECK_SOME(os::touch(path)): Failed to open file: No such file or dir= ectory

I reported a similar bug a while back:

https://issues.apache.org/jira/browse/MESOS-2684=

This seems to be a class of bugs where some filesystem operations which may= fail for unforeseen reasons are written as assertions which crash the proc= ess, rather than failing only the task and communicating back the error rea= son.



This email message contains information that Motus, LL= C considers confidential and/or proprietary, or may later designate as conf= idential and proprietary. It is intended only for use of the individual or = entity named above and should not be forwarded to any other persons or entities without the express consent = of Motus, LLC, nor should it be used for any purpose other than in the cour= se of any potential or actual business relationship with Motus, LLC. If the= reader of this message is not the intended recipient, or the employee or agent responsible to deliver it to = the intended recipient, you are hereby notified that any dissemination, dis= tribution, or copying of this communication is strictly prohibited. If you = have received this communication in error, please notify sender immediately and destroy the original messag= e.

Internal Revenue Service regulations require that cert= ain types of written advice include a disclaimer. To the extent the precedi= ng message contains advice relating to a Federal tax issue, unless expressl= y stated otherwise the advice is not intended or written to be used, and it cannot be used by the recipient or = any other taxpayer, for the purpose of avoiding Federal tax penalties, and = was not written to support the promotion or marketing of any transaction or= matter discussed herein.





--
=
Best Regards,
Haosdent Huang
--089e0141a152acbcbe051ec737b1--