drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Westin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3030) Foreman seems to be unable to cancel itself
Date Thu, 14 May 2015 23:07:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544569#comment-14544569
] 

Chris Westin commented on DRILL-3030:
-------------------------------------

Foreman has noticed (probably from the FragmentStateListener) that the fragment running on
the node you killed is gone, so now it's trying to cancel all the other fragments as part
of its cleanup. I wonder if it's blocked on trying to communicate with the dead node. I'll
check to see if we exclude that one from the list of ones to be cancelled. In any case, we
should issue this call with a timeout, because even if that isn't the case, any of the target
nodes could go down in between anyway. That's probably a better solution than trying to weed
out the failed fragment from the set of cancellations being sent.

> Foreman seems to be unable to cancel itself
> -------------------------------------------
>
>                 Key: DRILL-3030
>                 URL: https://issues.apache.org/jira/browse/DRILL-3030
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.0.0
>            Reporter: Ramana Inukonda Nagaraj
>            Assignee: Chris Westin
>         Attachments: threadstack
>
>
> Steps to repro:
> 1. Ran long running query on a clean drill restart. 
> 2. Killed a non foreman node. 
> 3. Restarted drillbits using clush.
> One of the drillbits(coincidentally a foreman node always) refused to shutdown. 
> Jstack shows that the foreman is waiting 
> {code}
>   at org.apache.drill.exec.rpc.ReconnectingConnection$ConnectionListeningFuture.waitAndRun(ReconnectingConnection.java:105)
>         at org.apache.drill.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:81)
>         - locked <0x000000073878aaa8> (a org.apache.drill.exec.rpc.control.ControlConnectionManager)
>         at org.apache.drill.exec.rpc.control.ControlTunnel.cancelFragment(ControlTunnel.java:57)
>         at org.apache.drill.exec.work.foreman.QueryManager.cancelExecutingFragments(QueryManager.java:192)
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:824)
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:768)
>         at org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:73)
>         at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState(Foreman.java:770)
>         at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:871)
>         at org.apache.drill.exec.work.foreman.Foreman.access$2700(Foreman.java:107)
>         at org.apache.drill.exec.work.foreman.Foreman$StateListener.moveToState(Foreman.java:1132)
>         at org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:460)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message