drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudheesh Katkam <sudhe...@apache.org>
Subject Re: Can this scenario cause a query to hang ?
Date Thu, 07 Apr 2016 21:36:25 GMT
I can answer one question myself. See inline.

As you mentioned elsewhere, this issue will rarely happen (and even harder
to reproduce) once DRILL-3714 is committed.

On Thu, Apr 7, 2016 at 11:38 AM, Sudheesh Katkam <sudheesh@apache.org>
wrote:

> Hakim,
>
> Can you point me to where [3] happens?
>
> Two questions:
>
> + Why is the root fragment blocked? If the user channel is closed, the
> query is cancelled [1], which should cancel and interrupt all running
> fragments. This interruption happens regardless of fragment failure that
> you have pointed out when user channel is closed [2]. Unless there is there
> a blocking call when failure is handled through the channel closed
> listener, I don't see why cancellation is not triggered.
>

It is possible for fragment failure to be fully processed before Foreman
cancels all running fragments, in which case the root fragment will not be
interrupted (because it is not cancelled, see
QueryManager#cancelExecutingFragments).


> + Why does the Foreman wait forever? AFAIK failures are reported
> immediately to the user. Is the root fragment not reported as FAILED to the
> Foreman?
>
> Thank you,
> Sudheesh
>
> [1]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java#L179
> [2]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java#L92
>
> On Thu, Apr 7, 2016 at 6:29 AM, John Omernik <john@omernik.com> wrote:
>
>> Abdel -
>>
>> I think I've seen this on a MapR cluster I run, especially on CTAS.  For
>> me, I have not brought it up because the cluster I am running on has some
>> serious personal issues (like being hardware that's near 7 years old, its
>> a
>> test cluster) and given the "hard to reproduce" nature of the problem,
>> I've
>> been reluctant to create noise. Given what you've described, it seems very
>> similar to CTAS hangs I've seen, but couldn't accurately reproduce.
>>
>> This didn't add much to your post, but I wanted to give you a +1 for
>> outlining this potential problem.  Once I move to more robust hardware,
>> and
>> I am in similar situations, I will post more verbose details from my side.
>>
>> John
>>
>>
>>
>> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche <
>> adeneche@maprtech.com>
>> wrote:
>>
>> > So, we've been seeing some queries hang, I've come up with a possible
>> > explanation, but so far it's really difficult to reproduce. Let me know
>> if
>> > you think this explanation doesn't hold up or if you have any ideas how
>> we
>> > can reproduce it. Thanks
>> >
>> > - generally it's a CTAS running on a large cluster (lot's of writers
>> > running in parallel)
>> > - logs show that the user channel was closed and UserServer caused the
>> root
>> > fragment to move to a FAILED state [1]
>> > - jstack shows that the root fragment is blocked in it's receiver
>> waiting
>> > for data [2]
>> > - jstack also shows that ALL other fragments are no longer running, and
>> the
>> > logs show that all of them succeeded [3]
>> > - the foreman waits *forever* for the root fragment to finish
>> >
>> > [1] the only case I can think off is when the user channel closed while
>> the
>> > fragment was waiting for an ack from the user client
>> > [2] if a writer finishes earlier than the others, it will send a data
>> batch
>> > to the root fragment that will be sent to the user. The root will then
>> > immediately block on it's receiver waiting for the remaining writers to
>> > finish
>> > [3] once the root fragment moves to a failed state, the receiver will
>> > immediately release any received batch and return an OK to the sender
>> > without putting the batch in it's blocking queue.
>> >
>> > Abdelhakim Deneche
>> >
>> > Software Engineer
>> >
>> >   <http://www.mapr.com/>
>> >
>> >
>> > Now Available - Free Hadoop On-Demand Training
>> > <
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message