drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Parth Chandra <par...@apache.org>
Subject Re: Native C++ Drill client handshake recovery
Date Tue, 13 Jun 2017 16:45:12 GMT
Looks like the code should not be compiled out unless
WIN32_SHUTDOWN_ON_TIMEOUT is defined. No reason for that to be defined on a
non Windows platform.

Query timeout is an artifact of the time when there was no heartbeat
between Drill clients and the server, so it is possible there is some
unexpected behavior. In this case, if the client gets a heartbeat, it
assumes that the query is still in progress since the server is still up.
It is hard to determine whether the client should time out or keep waiting
if there is no indication of whether the query is still in progress. One
way to check is to send a message to the server and ask if the query is in
progress, but that might be a change in the RPC protocol so we didn't quite
do it that way.  If the VM is paused, I would expect the heartbeat to fail
though.

BTW, I would recommend submitting multiple small patches. It makes it
easier to review and merge.

Most importantly, thank you for taking the time to help improve the client!

Parth



On Mon, Jun 12, 2017 at 1:02 PM, Ralph Little <rlittle@inetco.com> wrote:

> Hi,
>
> Thanks for your response:
>
> > The original caller to DrillClient::connect() thinks everything is
>> > hunky-dorey.
>> >
>>
>> Yes, that would be a problem. From what I remember, the recvHandshake call
>> blocks in m_ioservice.run. On return from run, the recvHandshake checks if
>> the error object m_pError is not null. m_pError is not null iff there has
>> been an error. Do you see this not working correctly?
>>
> Ah yes, I see that this code is compiled out by default unless
> WIN32_SHUTDOWN_ON_TIMEOUT is defined.
> I enabled that and it works as you say.
>
> > Currently, if you attempt a submitQuery() call when the connection is
>> > down, it just hangs because m_io_service is not running and
>> m_deadlineTimer
>> > never triggers as a fall back.
>> >
>> > Opinions?
>> >
>>
>> It is a good idea to check connection status before sending any message to
>> the server. LMK if you want to submit a patch :), I can review and merge
>> it
>> in.
>>
>
> I have added something and will send a patch shortly.
>
> As an aside, I'm trying to shore up the resilience of query failures
> from the back-end.
> If I set a query timeout then pause the HADOOP backend (in a VM) so that
> it is unresponsive, the application still hangs.
> This seems to be because the query timeout is reset every time a
> heartbeat (PONG) is received by the Native Client DLL.
> So again we get no application-side timeout.
>
> I still suspect that there may be a number of boundary scenarios that
> could cause the Native Client to lock up so I'm looking into a way to
> add a "cancel" application API so that the application can timeout
> itself and cancel the pending query.
>
> When I'm happy with what we have, I'll submit a patch for your perusal.
>
> Cheers,
> Ralph
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message