mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Debugging Scheduler HTTP API Failures
Date Mon, 15 Aug 2016 18:31:02 GMT
Dario was right.

I filed MESOS-6041 <https://issues.apache.org/jira/browse/MESOS-6041> to
enhance the error message.

On Sun, Aug 14, 2016 at 10:35 PM, Dario Rexin <drexin@apple.com> wrote:

> Zameer,
>
> the header value is enclosed in []. This is because headers can have
> multiple values and the library you use pus them into a list. You have to
> take the first item from that list and then it should work.
>
> On Aug 14, 2016, at 10:19 PM, Zameer Manji <zmanji@apache.org> wrote:
>
> Here is a MWE: https://github.com/zmanji/mesos-mwe
>
> Follow the instructions in the README to reproduce.
>
> On Sun, Aug 14, 2016 at 9:04 PM, Dario Rexin <drexin@apple.com> wrote:
>
>> Can you post the code somewhere?
>>
>>
>> On Aug 14, 2016, at 8:54 PM, Zameer Manji <zmanji@apache.org> wrote:
>>
>> Dario,
>>
>> The logs show that no disconnections occur until after the second POST
>> request. I would expect a log entry indicating a disconnect between the two
>> POST requests if the stream id changed.
>>
>> On Sun, Aug 14, 2016 at 8:28 PM, Dario Rexin <drexin@apple.com> wrote:
>>
>>> You’re absolutely right. I just tried the exact same steps and it worked
>>> fine for me. I also don’t see the log message. Do you have any reconnection
>>> logic in place? Is it possible, that your framework reconnected before you
>>> send the call? The Stream Id would change in that case.
>>>
>>>
>>> On Aug 14, 2016, at 6:48 PM, Zameer Manji <zmanji@apache.org> wrote:
>>>
>>> Dario,
>>>
>>> I do not think the case sensitivity matters here. If the master was
>>> expecting a header that was exactly 'Mesos-Stream-Id' and did not see it, I
>>> would expect to get the error response: `All non-subscribe calls should
>>> include the 'Mesos-Stream-Id' header`. That is the error response that you
>>> get when you do not set the header.
>>>
>>> Possibly related, I expected to see the stream id in the mesos logs. I
>>> see this log message
>>> <https://github.com/apache/mesos/blob/c9b70582e9fccab8f6863b0bd3a812b5969a8c24/src/master/master.cpp#L7473-L7474>
in
>>> the code, but I do not see it in the logs.
>>>
>>>
>>> On Sun, Aug 14, 2016 at 6:12 PM, Dario Rexin <drexin@apple.com> wrote:
>>>
>>>> Oh, sorry, I didn't see you actually set the header (wall of text ;) ).
>>>> That's an interesting issue, do you set the header case sensitive? I know
>>>> headers shouldn't be case sensitive, but maybe there's a bug in the Mesos
>>>> code. I have not seen this issue before.
>>>>
>>>> On Aug 14, 2016, at 5:58 PM, Zameer Manji <zmanji@apache.org> wrote:
>>>>
>>>> Hey,
>>>>
>>>> I'm using the Mesos HTTP API for the first time. I am currently
>>>> encountering an issue where after a successful SUBSCRIBE call and receiving
>>>> a SUBSCRIBED and HEARTBEAT event, a subsequent TEARDOWN call fails with
>>>> HTTP 400 with a message of "The stream ID included in this request didn't
>>>> match the stream ID currently associated with framework ID".
>>>>
>>>> Here is a detailed breakdown of what happens with logs:
>>>>
>>>> A new framework sends an SUBSCRIBE call with the following body:
>>>>
>>>> ````
>>>> framework_id {
>>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>> }
>>>> type: SUBSCRIBE
>>>> subscribe {
>>>>   framework_info {
>>>>     user: "user"
>>>>     name: "name"
>>>>     id {
>>>>       value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>>     }
>>>>   }
>>>> }
>>>> ````
>>>>
>>>> It then receives a 200 OK response with the following headers:
>>>> `{content-type=[application/x-protobuf], date=[Sat, 13 Aug 2016
>>>> 02:42:48 GMT], transfer-encoding=[chunked], mesos-stream-id=[71a0294f-e9c4
>>>> -4efe-b237-fb120836aaf8]}`
>>>>
>>>> Over this connection it receives a successful subscribed event:
>>>> ````
>>>> type: SUBSCRIBED
>>>> subscribed {
>>>>   framework_id {
>>>>     value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>>   }
>>>>   heartbeat_interval_seconds: 15.0
>>>> }
>>>> ````
>>>>
>>>> It also receives a single heart beat event.
>>>>
>>>> Then it tries to send the following request:
>>>> ````
>>>> Sending: framework_id {
>>>>   value: "0dffbee9-a514-4ffa-87e1-2850dd4dcf00"
>>>> }
>>>> type: TEARDOWN
>>>> ````
>>>> with the following headers:
>>>> `{accept=[application/x-protobuf], accept-encoding=[gzip],
>>>> mesos-stream-id=[71a0294f-e9c4-4efe-b237-fb120836aaf8]}`
>>>>
>>>> The response is a 400 with the body: `The stream ID included in this
>>>> request didn't match the stream ID currently associated with framework ID
>>>> '0dffbee9-a514-4ffa-87e1-2850dd4dcf00'`.
>>>>
>>>>
>>>> The master logs contains:
>>>> ````
>>>> I0813 02:42:48.376819 13934 http.cpp:381] HTTP POST for
>>>> /master/api/v1/scheduler from 192.168.33.1:60780 with
>>>> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
>>>> I0813 02:42:48.376998 13934 master.cpp:2146] Received subscription
>>>> request for HTTP framework 'name'
>>>> I0813 02:42:48.377104 13934 master.cpp:2244] Subscribing framework
>>>> 'name' with checkpointing disabled and capabilities [  ]
>>>> I0813 02:42:48.377378 13934 hierarchical.cpp:271] Added framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
>>>> I0813 02:42:49.475163 13929 http.cpp:381] HTTP POST for
>>>> /master/api/v1/scheduler from 192.168.33.1:60782 with
>>>> User-Agent='Google-HTTP-Java-Client/1.20.0 (gzip)'
>>>> I0813 02:42:51.133513 13930 master.cpp:1284] Framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) disconnected
>>>> I0813 02:42:51.133597 13930 master.cpp:2725] Disconnecting framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
>>>> I0813 02:42:51.133618 13930 master.cpp:2749] Deactivating framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
>>>> I0813 02:42:51.133644 13930 master.cpp:1297] Giving framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name) 0ns to failover
>>>> I0813 02:42:51.133692 13932 hierarchical.cpp:382] Deactivated framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
>>>> I0813 02:42:51.137265 13931 master.cpp:5561] Framework failover
>>>> timeout, removing framework 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
>>>> I0813 02:42:51.137339 13931 master.cpp:6296] Removing framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00 (name)
>>>> I0813 02:42:51.137464 13931 hierarchical.cpp:333] Removed framework
>>>> 0dffbee9-a514-4ffa-87e1-2850dd4dcf00
>>>> ````
>>>> Note the immediate disconnection after the second POST is intentional.
>>>>
>>>> This is with Mesos 1.0.0 on Ubuntu Trusty.
>>>>
>>>> What can I do to debug this issue? The logs do not provide a lot of
>>>> information to act on. The stream id generated by mesos is not in the logs,
>>>> nor anything indicating that an HTTP 400 was sent.
>>>>
>>>> --
>>>> Zameer Manji
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message