drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdel Hakim Deneche <adene...@maprtech.com>
Subject Re: Can this scenario cause a query to hang ?
Date Sun, 10 Apr 2016 06:20:24 GMT
There are many ways a query could hang. JStack of the foreman node will
definitely help confirm it's the same issue.

Thanks

On Fri, Apr 8, 2016 at 6:06 PM, François Méthot <fmethot78@gmail.com> wrote:

> It might just adds up to the mystery of this issue but when we start
> getting those hanging CTAS query,
> if we restart our drill cluster and the problem goes away.
>
> Next time we start getting this problem I will try to collect the JStack
> output of the foreman too.
>
> Thanks for looking into this.
>
> Francois
>
>
>
> On Fri, Apr 8, 2016 at 2:20 AM, Abdel Hakim Deneche <adeneche@maprtech.com
> >
> wrote:
>
> > Opened DRILL-4595 [1]  to track this issue.
> >
> > Thanks
> >
> > [1] https://issues.apache.org/jira/browse/DRILL-4595
> >
> > On Fri, Apr 8, 2016 at 6:42 AM, Abdel Hakim Deneche <
> adeneche@maprtech.com
> > >
> > wrote:
> >
> > > Hey John, thanks for sharing your experience. If you see this again try
> > > collecting the jstack output for the foreman node of the query, and
> also
> > > check in the query profile which fragments are still marked as RUNNING.
> > >
> > > Thanks
> > >
> > > On Thu, Apr 7, 2016 at 2:29 PM, John Omernik <john@omernik.com> wrote:
> > >
> > >> Abdel -
> > >>
> > >> I think I've seen this on a MapR cluster I run, especially on CTAS.
> For
> > >> me, I have not brought it up because the cluster I am running on has
> > some
> > >> serious personal issues (like being hardware that's near 7 years old,
> > its
> > >> a
> > >> test cluster) and given the "hard to reproduce" nature of the problem,
> > >> I've
> > >> been reluctant to create noise. Given what you've described, it seems
> > very
> > >> similar to CTAS hangs I've seen, but couldn't accurately reproduce.
> > >>
> > >> This didn't add much to your post, but I wanted to give you a +1 for
> > >> outlining this potential problem.  Once I move to more robust
> hardware,
> > >> and
> > >> I am in similar situations, I will post more verbose details from my
> > side.
> > >>
> > >> John
> > >>
> > >>
> > >>
> > >> On Thu, Apr 7, 2016 at 2:29 AM, Abdel Hakim Deneche <
> > >> adeneche@maprtech.com>
> > >> wrote:
> > >>
> > >> > So, we've been seeing some queries hang, I've come up with a
> possible
> > >> > explanation, but so far it's really difficult to reproduce. Let me
> > know
> > >> if
> > >> > you think this explanation doesn't hold up or if you have any ideas
> > how
> > >> we
> > >> > can reproduce it. Thanks
> > >> >
> > >> > - generally it's a CTAS running on a large cluster (lot's of writers
> > >> > running in parallel)
> > >> > - logs show that the user channel was closed and UserServer caused
> the
> > >> root
> > >> > fragment to move to a FAILED state [1]
> > >> > - jstack shows that the root fragment is blocked in it's receiver
> > >> waiting
> > >> > for data [2]
> > >> > - jstack also shows that ALL other fragments are no longer running,
> > and
> > >> the
> > >> > logs show that all of them succeeded [3]
> > >> > - the foreman waits *forever* for the root fragment to finish
> > >> >
> > >> > [1] the only case I can think off is when the user channel closed
> > while
> > >> the
> > >> > fragment was waiting for an ack from the user client
> > >> > [2] if a writer finishes earlier than the others, it will send a
> data
> > >> batch
> > >> > to the root fragment that will be sent to the user. The root will
> then
> > >> > immediately block on it's receiver waiting for the remaining writers
> > to
> > >> > finish
> > >> > [3] once the root fragment moves to a failed state, the receiver
> will
> > >> > immediately release any received batch and return an OK to the
> sender
> > >> > without putting the batch in it's blocking queue.
> > >> >
> > >> > Abdelhakim Deneche
> > >> >
> > >> > Software Engineer
> > >> >
> > >> >   <http://www.mapr.com/>
> > >> >
> > >> >
> > >> > Now Available - Free Hadoop On-Demand Training
> > >> > <
> > >> >
> > >>
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >> > >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > >
> > > Abdelhakim Deneche
> > >
> > > Software Engineer
> > >
> > >   <http://www.mapr.com/>
> > >
> > >
> > > Now Available - Free Hadoop On-Demand Training
> > > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> > >
> >
> >
> >
> > --
> >
> > Abdelhakim Deneche
> >
> > Software Engineer
> >
> >   <http://www.mapr.com/>
> >
> >
> > Now Available - Free Hadoop On-Demand Training
> > <
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > >
> >
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message