mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrei Budnik (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-7742) ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
Date Fri, 12 Jan 2018 12:22:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318666#comment-16318666
] 

Andrei Budnik edited comment on MESOS-7742 at 1/12/18 12:21 PM:
----------------------------------------------------------------

io switchboard [terminates|https://github.com/apache/mesos/blob/3d8ef23c0ecec028641d7beee4c85233495a030b/src/slave/containerizer/mesos/io/switchboard.cpp#L1218]
itself when io redirect is finished.
If io switchboard terminates before it receives {{\r\n\r\n}} or before agent receives {{200
OK}} response from the io switchboard, connection to the agent (via unix socket) will be closed,
so agent's {{ConnectionProcess}} will handle this case as an unexpected [EOF| https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
on [reading|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
of the response. That will lead to {{500 Internal Server Error}} response from the agent for
{{ATTACH_CONTAINER_INPUT}} request.


was (Author: abudnik):
As we have launched [`cat`|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/tests/api_tests.cpp#L6529]
command as a nested container, related ioswitchboard process will be in the same process group.
Whenever a process group leader ({{cat}}) terminates, all processes in the process group are
killed, including ioswitchboard.
ioswitchboard handles HTTP requests from the slave, e.g. {{ATTACH_CONTAINER_INPUT}} request
in this test.
Usually, after reading all client's data, {{Http::_attachContainerInput()}} invokes a callback
which calls [writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/http.cpp#L3223].
[writer.close()|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L561]
implies sending a [\r\n\r\n|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1045]
to the ioswitchboard process.
ioswitchboard returns [200 OK|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/src/slave/containerizer/mesos/io/switchboard.cpp#L1572]
response, hence agent returns {{200 OK}} for {{ATTACH_CONTAINER_INPUT}} request as expected.

However, if ioswitchboard terminates before it receives {{\r\n\r\n}} or before agent receives
{{200 OK}} response from the ioswitchboard, connection (via unix socket) might be closed,
so corresponding {{ConnectionProcess}} will handle this case as an unexpected [EOF| https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293
https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1293]
during [read|https://github.com/apache/mesos/blob/3290b401d20f2db2933294470ea8a2356a47c305/3rdparty/libprocess/src/http.cpp#L1216]
of a response. That will lead to {{500 Internal Server Error}} response from the agent.

> ContentType/AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky
> ------------------------------------------------------------------------------
>
>                 Key: MESOS-7742
>                 URL: https://issues.apache.org/jira/browse/MESOS-7742
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Andrei Budnik
>              Labels: flaky-test, mesosphere-oncall
>         Attachments: AgentAPITest.LaunchNestedContainerSession-badrun.txt, LaunchNestedContainerSessionDisconnected-badrun.txt
>
>
> Observed this on ASF CI and internal Mesosphere CI. Affected tests:
> {noformat}
> AgentAPIStreamingTest.AttachInputToNestedContainerSession
> AgentAPITest.LaunchNestedContainerSession
> AgentAPITest.AttachContainerInputAuthorization/0
> AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> AgentAPITest.LaunchNestedContainerSessionDisconnected/1
> {noformat}
> This issue comes at least in three different flavours. Take {{AgentAPIStreamingTest.AttachInputToNestedContainerSession}}
as an example.
> h5. Flavour 1
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "503 Service Unavailable"
> Expected: http::OK().status
> Which is: "200 OK"
>     Body: ""
> {noformat}
> h5. Flavour 2
> {noformat}
> ../../src/tests/api_tests.cpp:6473
> Value of: (response).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
>     Body: "Disconnected"
> {noformat}
> h5. Flavour 3
> {noformat}
> /home/ubuntu/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-ubuntu-16.04/mesos/src/tests/api_tests.cpp:6367
> Value of: (sessionResponse).get().status
>   Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
>     Body: ""
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message