drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdel Hakim Deneche <adene...@maprtech.com>
Subject Re: Can this scenario cause a query to hang ?
Date Fri, 08 Apr 2016 06:20:35 GMT
Opened DRILL-4595 [1]  to track this issue.

Thanks

[1] https://issues.apache.org/jira/browse/DRILL-4595

On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche <adeneche@maprtech.com>
wrote:

> Hey John, thanks for sharing your experience. If you see this again try
> collecting the jstack output for the foreman node of the query, and also
> check in the query profile which fragments are still marked as RUNNING.
>
> Thanks
>
> On Thu, Apr 7, 2016 at 2:29 PM, John Omernik <john@omernik.com> wrote:
>
>> Abdel -
>>
>> I think I've seen this on a MapR cluster I run, especially on CTAS.  For
>> me, I have not brought it up because the cluster I am running on has some
>> serious personal issues (like being hardware that's near 7 years old, its
>> a
>> test cluster) and given the "hard to reproduce" nature of the problem,
>> I've
>> been reluctant to create noise. Given what you've described, it seems very
>> similar to CTAS hangs I've seen, but couldn't accurately reproduce.
>>
>> This didn't add much to your post, but I wanted to give you a +1 for
>> outlining this potential problem.  Once I move to more robust hardware,
>> and
>> I am in similar situations, I will post more verbose details from my side.
>>
>> John
>>
>>
>>
>> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche <
>> adeneche@maprtech.com>
>> wrote:
>>
>> > So, we've been seeing some queries hang, I've come up with a possible
>> > explanation, but so far it's really difficult to reproduce. Let me know
>> if
>> > you think this explanation doesn't hold up or if you have any ideas how
>> we
>> > can reproduce it. Thanks
>> >
>> > - generally it's a CTAS running on a large cluster (lot's of writers
>> > running in parallel)
>> > - logs show that the user channel was closed and UserServer caused the
>> root
>> > fragment to move to a FAILED state [1]
>> > - jstack shows that the root fragment is blocked in it's receiver
>> waiting
>> > for data [2]
>> > - jstack also shows that ALL other fragments are no longer running, and
>> the
>> > logs show that all of them succeeded [3]
>> > - the foreman waits *forever* for the root fragment to finish
>> >
>> > [1] the only case I can think off is when the user channel closed while
>> the
>> > fragment was waiting for an ack from the user client
>> > [2] if a writer finishes earlier than the others, it will send a data
>> batch
>> > to the root fragment that will be sent to the user. The root will then
>> > immediately block on it's receiver waiting for the remaining writers to
>> > finish
>> > [3] once the root fragment moves to a failed state, the receiver will
>> > immediately release any received batch and return an OK to the sender
>> > without putting the batch in it's blocking queue.
>> >
>> > Abdelhakim Deneche
>> >
>> > Software Engineer
>> >
>> >   <http://www.mapr.com/>
>> >
>> >
>> > Now Available - Free Hadoop On-Demand Training
>> > <
>> >
>> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
>> > >
>> >
>>
>
>
>
> --
>
> Abdelhakim Deneche
>
> Software Engineer
>
>   <http://www.mapr.com/>
>
>
> Now Available - Free Hadoop On-Demand Training
> <http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message