Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAC2U_PfSEBFx8tsmGPjGv0_vkDE=J-jMoyKGw59JhOK1_YNBiA@mail.gmail.com>
References: <EC6F5461-12F6-40F3-BC05-2575765AAC6B@motus.com>
	<FC5C3175-9DC5-40DE-A748-2D62E03AE418@opentable.com>
	<CAD2Zrn3gJ_jytd8KAmJjBqWerE74N48ummxRh7DWuO8Beu6SVw@mail.gmail.com>
	<8C987AB6-5C58-4683-A4AA-1A168EC15243@motus.com>
	<CAC2U_PfSEBFx8tsmGPjGv0_vkDE=J-jMoyKGw59JhOK1_YNBiA@mail.gmail.com>
Date: Thu, 3 Sep 2015 01:44:24 +0800
Message-ID: 
 <CAFt=ROOUVYtrXX8sAfcpmnAgfrH5nnGZHVdMZ2afpfmg4G=5zA@mail.gmail.com>
Subject: Re: mesos-slave crashing with CHECK_SOME
From: haosdent <haosdent@gmail.com>
To: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=089e0141a152acbcbe051ec737b1

--089e0141a152acbcbe051ec737b1
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

If could show the content of path in CHECK_SOME, it would more easy to
debug here. According the log in
https://groups.google.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ and
0.22.1 code:

    const string& path =3D paths::getExecutorSentinelPath(
        metaDir, info.id(), framework->id, executor->id,
executor->containerId);

framework->id =3D=3D> 20141209-011108-1378273290-5050-23221-0001
executor->id =3D=3D> tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab'

metaDir could get from your slave work_dir, info.id() is your slave id,
could you see the executor->containerId in complete slave log. And if you
could reproduce this problem every time, it would very helpful if you add a
trace log to slave and recompile it.

On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <tim@mesosphere.io> wrote:

> Hi Scott,
>
> I wonder if you can try the latest Mesos and see if you can repro this?
>
> And if it is can you put down the example task and steps? I couldn't see
> disk full in your slave log so I'm not sure if it's exactly the same
> problem of MESOS-2684.
>
> Tim
>
> On Wed, Sep 2, 2015 at 5:15 AM, Scott Rankin <srankin@motus.com> wrote:
>
>> Hi Marco,
>>
>> I certainly don=E2=80=99t want to start a flame war, and I actually real=
ized
>> after I added my comment to MESOS-2684 that it=E2=80=99s not quite the s=
ame thing.
>>
>> As far as I can tell, in our situation, there=E2=80=99s no underlying di=
sk
>> issue.  It seems like this is some sort of race condition (maybe?) with
>> docker containers and executors shutting down.  I=E2=80=99m perfectly ha=
ppy with
>> Mesos choosing to shut down in the case of a failure or unexpected
>> situation =E2=80=93 that=E2=80=99s a methodology that we adopt ourselves=
.  I=E2=80=99m just trying
>> to get a little more information about what the underlying issue is so t=
hat
>> we can resolve it. I don=E2=80=99t know enough about Mesos internals to =
be able to
>> answer that question just yet.
>>
>> It=E2=80=99s also inconvenient because, while Mesos is well-behaved and =
restarts
>> gracefully, as of 0.22.1, it=E2=80=99s not recovering the Docker executo=
rs =E2=80=93 so a
>> mesos-slave crash also brings down applications.
>>
>> Thanks,
>> Scott
>>
>> From: Marco Massenzio
>> Reply-To: "user@mesos.apache.org"
>> Date: Tuesday, September 1, 2015 at 7:33 PM
>> To: "user@mesos.apache.org"
>> Subject: Re: mesos-slave crashing with CHECK_SOME
>>
>> That's one of those areas for discussions that is so likely to generate =
a
>> flame war that I'm hesitant to wade in :)
>>
>> In general, I would agree with the sentiment expressed there:
>>
>> > If the task fails, that is unfortunate, but not the end of the world.
>> Other tasks should not be affected.
>>
>> which is, in fact, to large extent exactly what Mesos does; the example
>> given in MESOS-2684, as it happens, is for a "disk full failure" - carry=
ing
>> on as if nothing had happened, is only likely to lead to further (and
>> worse) disappointment.
>>
>> The general philosophy back at Google (and which certainly informs the
>> design of Borg[0]) was "fail early, fail hard" so that either (a) the
>> service is restarted and hopefully the root cause cleared or (b) someone
>> (who can hopefully do something) will be alerted about it.
>>
>> I think it's ultimately a matter of scale: up to a few tens of servers,
>> you can assume there is some sort of 'log-monitor' that looks out for
>> errors and other anomalies and alerts humans that will then take a look =
and
>> possibly apply some corrective action - when you're up to hundreds or
>> thousands (definitely Mesos territory) that's not practical: the system
>> should either self-heal or crash-and-restart.
>>
>> All this to say, that it's difficult to come up with a general
>> *automated* approach to unequivocally decide if a failure is "fatal" or
>> could just be safely "ignored" (after appropriate error logging) - in
>> general, when in doubt it's probably safer to "noisily crash & restart" =
and
>> rely on the overall system's HA architecture to take care of replication
>> and consistency.
>> (and an intelligent monitoring system that only alerts when some failure
>> threshold is exceeded).
>>
>> From what I've seen so far (granted, still a novice here) it seems that
>> Mesos subscribes to this notion, assuming that Agent Nodes will come and
>> go, and usually Tasks survive (for a certain amount of time anyway) a Sl=
ave
>> restart (obviously, if the physical h/w is the ultimate cause of failure=
,
>> well, then all bets are off).
>>
>> Having said all that - if there are areas where we have been over-eager
>> with our CHECKs, we should definitely revisit that and make it more
>> crash-resistant, absolutely.
>>
>> [0] http://research.google.com/pubs/pub43438.html
>>
>> *Marco Massenzio*
>>
>> *Distributed Systems Engineer http://codetrips.com <http://codetrips.com=
>*
>>
>> On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlansker <
>> sschlansker@opentable.com> wrote:
>>
>>>
>>>
>>> On Aug 31, 2015, at 11:54 AM, Scott Rankin <srankin@motus.com> wrote:
>>> >
>>> > tag=3Dmesos-slave[12858]:  F0831 09:37:29.838184 12898 slave.cpp:3354=
]
>>> CHECK_SOME(os::touch(path)): Failed to open file: No such file or direc=
tory
>>>
>>> I reported a similar bug a while back:
>>>
>>> https://issues.apache.org/jira/browse/MESOS-2684
>>>
>>> This seems to be a class of bugs where some filesystem operations which
>>> may fail for unforeseen reasons are written as assertions which crash t=
he
>>> process, rather than failing only the task and communicating back the e=
rror
>>> reason.
>>>
>>>
>>>
>> This email message contains information that Motus, LLC considers
>> confidential and/or proprietary, or may later designate as confidential =
and
>> proprietary. It is intended only for use of the individual or entity nam=
ed
>> above and should not be forwarded to any other persons or entities witho=
ut
>> the express consent of Motus, LLC, nor should it be used for any purpose
>> other than in the course of any potential or actual business relationshi=
p
>> with Motus, LLC. If the reader of this message is not the intended
>> recipient, or the employee or agent responsible to deliver it to the
>> intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication is strictly prohibited. I=
f
>> you have received this communication in error, please notify sender
>> immediately and destroy the original message.
>>
>> Internal Revenue Service regulations require that certain types of
>> written advice include a disclaimer. To the extent the preceding message
>> contains advice relating to a Federal tax issue, unless expressly stated
>> otherwise the advice is not intended or written to be used, and it canno=
t
>> be used by the recipient or any other taxpayer, for the purpose of avoid=
ing
>> Federal tax penalties, and was not written to support the promotion or
>> marketing of any transaction or matter discussed herein.
>>
>
>


--=20
Best Regards,
Haosdent Huang

--089e0141a152acbcbe051ec737b1
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">If could show the content of path in CHECK_SOME, it would =
more easy to debug here. According the log in <a href=3D"https://groups.goo=
gle.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ">https://groups.google=
.com/forum/#!topic/marathon-framework/oKXhfQUcoMQ</a>=C2=A0and 0.22.1 code:=
<div><br></div><div><div>=C2=A0 =C2=A0 const string&amp; path =3D paths::ge=
tExecutorSentinelPath(</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 metaDir, <a hr=
ef=3D"http://info.id">info.id</a>(), framework-&gt;id, executor-&gt;id, exe=
cutor-&gt;containerId);</div></div><div><br></div><div>framework-&gt;id =3D=
=3D&gt; 20141209-011108-1378273290-5050-23221-0001=C2=A0</div><div>executor=
-&gt;id =3D=3D&gt;=C2=A0tools.1d52eed8-062c-11e5-90d3-f2a3161ca8ab&#39;<br>=
</div><div><br></div><div>metaDir could get from your slave work_dir, <a hr=
ef=3D"http://info.id">info.id</a>() is your slave id, could you see the exe=
cutor-&gt;containerId in complete slave log. And if you could reproduce thi=
s problem every time, it would very helpful if you add a trace log to slave=
 and recompile it.=C2=A0</div></div><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Thu, Sep 3, 2015 at 12:49 AM, Tim Chen <span dir=3D"l=
tr">&lt;<a href=3D"mailto:tim@mesosphere.io" target=3D"_blank">tim@mesosphe=
re.io</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"m=
argin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"l=
tr">Hi Scott,<div><br></div><div>I wonder if you can try the latest Mesos a=
nd see if you can repro this?</div><div><br></div><div>And if it is can you=
 put down the example task and steps? I couldn&#39;t see disk full in your =
slave log so I&#39;m not sure if it&#39;s exactly the same problem of MESOS=
-2684.</div><span class=3D"HOEnZb"><font color=3D"#888888"><div><br></div><=
div>Tim</div></font></span></div><div class=3D"HOEnZb"><div class=3D"h5"><d=
iv class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Sep 2, 2015=
 at 5:15 AM, Scott Rankin <span dir=3D"ltr">&lt;<a href=3D"mailto:srankin@m=
otus.com" target=3D"_blank">srankin@motus.com</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex">


<div style=3D"word-wrap:break-word;color:rgb(0,0,0);font-size:14px;font-fam=
ily:Calibri,sans-serif">
<div>
<div>
<div>Hi Marco,=C2=A0</div>
<div><br>
</div>
<div>I certainly don=E2=80=99t want to start a flame war, and I actually re=
alized after I added my comment to MESOS-2684 that it=E2=80=99s not quite t=
he same thing. =C2=A0</div>
<div><br>
</div>
<div>As far as I can tell, in our situation, there=E2=80=99s no underlying =
disk issue.=C2=A0 It seems like this is some sort of race condition (maybe?=
) with docker containers and executors shutting down.=C2=A0 I=E2=80=99m per=
fectly happy with Mesos choosing to shut down in the case
 of a failure or unexpected situation =E2=80=93 that=E2=80=99s a methodolog=
y that we adopt ourselves.=C2=A0 I=E2=80=99m just trying to get a little mo=
re information about what the underlying issue is so that we can resolve it=
. I don=E2=80=99t know enough about Mesos internals to be able to answer
 that question just yet.=C2=A0</div>
<div><br>
</div>
<div>It=E2=80=99s also inconvenient because, while Mesos is well-behaved an=
d restarts gracefully, as of 0.22.1, it=E2=80=99s not recovering the Docker=
 executors =E2=80=93 so a mesos-slave crash also brings down applications. =
=C2=A0</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Scott</div>
<div>
<div></div>
</div>
</div>
</div>
<div><br>
</div>
<span>
<div style=3D"font-family:Calibri;font-size:12pt;text-align:left;color:blac=
k;BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOTTOM:0in;PADD=
ING-LEFT:0in;PADDING-RIGHT:0in;BORDER-TOP:#b5c4df 1pt solid;BORDER-RIGHT:me=
dium none;PADDING-TOP:3pt">
<span style=3D"font-weight:bold">From: </span>Marco Massenzio<br>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@mesos.apache.org" target=3D"_blank">user@mesos.apache.org</a>&quot;<br>
<span style=3D"font-weight:bold">Date: </span>Tuesday, September 1, 2015 at=
 7:33 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@me=
sos.apache.org" target=3D"_blank">user@mesos.apache.org</a>&quot;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: mesos-slave crashing w=
ith CHECK_SOME<br>
</div><div><div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr"><font face=3D"verdana,sans-serif">That&#39;s one of those =
areas for discussions that is so likely to generate a flame war that I&#39;=
m hesitant to wade in :)</font>
<div><font face=3D"verdana,sans-serif"><br>
</font></div>
<div><font face=3D"verdana,sans-serif">In general, I would agree with the s=
entiment expressed there:</font></div>
<div><font face=3D"verdana,sans-serif"><br>
</font></div>
<div><font face=3D"verdana,sans-serif">&gt;=C2=A0<span style=3D"color:rgb(5=
1,51,51);font-size:14px;line-height:20px">If the task fails, that is unfort=
unate, but not the end of the world. Other tasks should not be affected.</s=
pan></font></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif">which is, in fact, to large extent exactly =
what Mesos does; the example given in MESOS-2684, as it happens, is for a &=
quot;disk full failure&quot; - carrying on as
 if nothing had happened, is only likely to lead to further (and worse) dis=
appointment.</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif">The general philosophy back at Google (and =
which certainly informs the design of Borg[0]) was &quot;fail early, fail h=
ard&quot; so that either (a) the service is restarted
 and hopefully the root cause cleared or (b) someone (who can hopefully do =
something) will be alerted about it.</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif">I think it&#39;s ultimately a matter of sca=
le: up to a few tens of servers, you can assume there is some sort of &#39;=
log-monitor&#39; that looks out for errors and other
 anomalies and alerts humans that will then take a look and possibly apply =
some corrective action - when you&#39;re up to hundreds or thousands (defin=
itely Mesos territory) that&#39;s not practical: the system should either s=
elf-heal or crash-and-restart.</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><font face=3D"verdana,sans-serif"><span style=3D"color:rgb(51,51,51);f=
ont-size:14px;line-height:20px">All this to say, that it&#39;s difficult to=
 come up with a general *automated*=C2=A0</span><span style=3D"color:rgb(51=
,51,51);font-size:14px;line-height:20px">approach</span><span style=3D"colo=
r:rgb(51,51,51);font-size:14px;line-height:20px">=C2=A0to
 unequivocally decide if a failure is &quot;fatal&quot; or could just be sa=
fely &quot;ignored&quot; (after=C2=A0appropriate error logging) - in genera=
l, when in doubt it&#39;s probably safer to &quot;noisily crash &amp; resta=
rt&quot; and rely on the overall system&#39;s HA architecture to take care =
of replication
 and consistency.</span></font></div>
<div><font face=3D"verdana,sans-serif"><span style=3D"color:rgb(51,51,51);f=
ont-size:14px;line-height:20px">(and an intelligent monitoring system that =
only alerts when some failure threshold is exceeded).</span></font></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif">From what I&#39;ve seen so far (granted, st=
ill a novice here) it seems that Mesos subscribes to this notion, assuming =
that Agent Nodes will come and go, and usually
 Tasks survive (for a certain amount of time anyway) a Slave restart (obvio=
usly, if the physical h/w is the ultimate cause of failure, well, then all =
bets are off).</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif">Having said all that - if there are areas w=
here we have been over-eager with our CHECKs, we should definitely revisit =
that and make it more crash-resistant,
 absolutely.</font></span></div>
<div><span style=3D"color:rgb(51,51,51);font-size:14px;line-height:20px"><f=
ont face=3D"verdana,sans-serif"><br>
</font></span></div>
<div><font face=3D"verdana,sans-serif"><span style=3D"color:rgb(51,51,51);f=
ont-size:14px;line-height:20px">[0]=C2=A0</span><font color=3D"#333333"><sp=
an style=3D"font-size:14px;line-height:20px"><a href=3D"http://research.goo=
gle.com/pubs/pub43438.html" target=3D"_blank">http://research.google.com/pu=
bs/pub43438.html</a></span></font></font></div>
<div class=3D"gmail_extra"><br clear=3D"all">
<div>
<div>
<div dir=3D"ltr">
<div>
<div dir=3D"ltr"><i><font color=3D"#0b5394" style=3D"background-color:rgb(2=
55,255,255)">Marco Massenzio</font></i>
<div><i><font color=3D"#6fa8dc">Distributed Systems Engineer<br>
<a href=3D"http://codetrips.com" target=3D"_blank">http://codetrips.com</a>=
</font></i></div>
</div>
</div>
</div>
</div>
</div>
<br>
<div class=3D"gmail_quote">On Mon, Aug 31, 2015 at 12:47 PM, Steven Schlans=
ker <span dir=3D"ltr">
&lt;<a href=3D"mailto:sschlansker@opentable.com" target=3D"_blank">sschlans=
ker@opentable.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<span><br>
<br>
On Aug 31, 2015, at 11:54 AM, Scott Rankin &lt;<a href=3D"mailto:srankin@mo=
tus.com" target=3D"_blank">srankin@motus.com</a>&gt; wrote:<br>
&gt;<br>
&gt; tag=3Dmesos-slave[12858]:=C2=A0 F0831 09:37:29.838184 12898 slave.cpp:=
3354] CHECK_SOME(os::touch(path)): Failed to open file: No such file or dir=
ectory<br>
<br>
</span>I reported a similar bug a while back:<br>
<br>
<a href=3D"https://issues.apache.org/jira/browse/MESOS-2684" rel=3D"norefer=
rer" target=3D"_blank">https://issues.apache.org/jira/browse/MESOS-2684</a>=
<br>
<br>
This seems to be a class of bugs where some filesystem operations which may=
 fail for unforeseen reasons are written as assertions which crash the proc=
ess, rather than failing only the task and communicating back the error rea=
son.<br>
<br>
<br>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
</div></div></span><span>
<p><font size=3D"-2">This email message contains information that Motus, LL=
C considers confidential and/or proprietary, or may later designate as conf=
idential and proprietary. It is intended only for use of the individual or =
entity named above and should not
 be forwarded to any other persons or entities without the express consent =
of Motus, LLC, nor should it be used for any purpose other than in the cour=
se of any potential or actual business relationship with Motus, LLC. If the=
 reader of this message is not the
 intended recipient, or the employee or agent responsible to deliver it to =
the intended recipient, you are hereby notified that any dissemination, dis=
tribution, or copying of this communication is strictly prohibited. If you =
have received this communication
 in error, please notify sender immediately and destroy the original messag=
e.</font></p>
<p><font size=3D"-2">Internal Revenue Service regulations require that cert=
ain types of written advice include a disclaimer. To the extent the precedi=
ng message contains advice relating to a Federal tax issue, unless expressl=
y stated otherwise the advice is not
 intended or written to be used, and it cannot be used by the recipient or =
any other taxpayer, for the purpose of avoiding Federal tax penalties, and =
was not written to support the promotion or marketing of any transaction or=
 matter discussed herein.</font></p>
</span></div>

</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"gmail_signature">Best Regards,<br><div>Haosdent Huang</div></=
div>
</div>

--089e0141a152acbcbe051ec737b1--