drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Hou (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5840) A query that includes sort completes, and then loses Drill connection. Drill becomes unresponsive, and cannot restart because it cannot communicate with Zookeeper
Date Mon, 09 Oct 2017 18:39:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197476#comment-16197476
] 

Robert Hou commented on DRILL-5840:
-----------------------------------

The cluster ran out of disk space and caused problems for Zookeeper.

>From /opt/mapr/zookeeper/zookeeper-3.4.5/logs

2017-10-04 16:22:54,475 [myid:] - ERROR [SyncThread:0:SyncRequestProcessor@151] - Severe unrecoverable
error, exiting
java.io.IOException: No space left on device
        at javStopping zookeeper ... STOPPED
Starting zookeeper ... 2017-10-05 15:07:54,495 [myid:] - INFO  [main:QuorumPeerConfig@101]
- Reading configuration from: /opt/mapr/zookeeper/zookeeper-3.4.5/conf/zoo.cfg


> A query that includes sort completes, and then loses Drill connection. Drill becomes
unresponsive, and cannot restart because it cannot communicate with Zookeeper
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5840
>                 URL: https://issues.apache.org/jira/browse/DRILL-5840
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>    Affects Versions: 1.11.0
>            Reporter: Robert Hou
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> Query is:
> {noformat}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> select count(*) from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl`
order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
> {noformat}
> Query tries to complete, but cannot.  It takes 20 hours from the time the query tries
to complete, to the time Drill finally loses its connection.
> From the drillbit.log:
> {noformat}
> 2017-10-03 16:28:14,892 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman
- 262bec7f-3539-0dd7-6fea-f2959f9df3b6: State change requested RUNNING --> COMPLETED
> 2017-10-04 01:47:27,698 [UserServer-1] DEBUG o.a.d.e.r.u.UserServerRequestHandler - Received
query to run.  Returning query handle.
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  o.a.d.exec.work.foreman.QueryManager
- Failure while trying to delete the estore profile for this query.
> org.apache.drill.common.exceptions.DrillRuntimeException: unable to delete node at /running/262bec7f-3539-0dd7-6fea-f2959f9df3b6
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:343) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.coord.zk.ZkEphemeralStore.remove(ZkEphemeralStore.java:108)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:293)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.moveToState(Foreman.java:964) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.access$2600(Foreman.java:113) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1025)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent(Foreman.java:1018)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.common.EventProcessor.processEvents(EventProcessor.java:107) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.common.EventProcessor.sendEvent(EventProcessor.java:65) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman$StateSwitch.addEvent(Foreman.java:1020)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.addToEventQueue(Foreman.java:1038) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.nodeComplete(QueryManager.java:498)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.access$100(QueryManager.java:66)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager$NodeTracker.fragmentComplete(QueryManager.java:462)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.fragmentDone(QueryManager.java:147)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.access$400(QueryManager.java:66)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager$1.statusUpdate(QueryManager.java:525)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.control.WorkEventBus.statusUpdate(WorkEventBus.java:71)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.fragment.FragmentStatusReporter.sendStatus(FragmentStatusReporter.java:124)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.fragment.FragmentStatusReporter.stateChanged(FragmentStatusReporter.java:94)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:304)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:267)
[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
> 	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.delete(ZookeeperClient.java:340) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	... 27 common frames omitted
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman
- 262bec7f-3539-0dd7-6fea-f2959f9df3b6: cleaning up.
> 2017-10-04 03:30:02,916 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.d.exec.rpc.control.WorkEventBus
- Removing fragment status listener for queryId 262bec7f-3539-0dd7-6fea-f2959f9df3b6.
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.drill.exec.work.foreman.Foreman
- 262bec7f-3539-0dd7-6fea-f2959f9df3b6: State change requested COMPLETED --> CANCELLATION_REQUESTED
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] WARN  o.a.drill.exec.work.foreman.Foreman
- Dropping request to move to CANCELLATION_REQUESTED state as query is already at COMPLETED
state (which is terminal).
> 2017-10-04 03:30:02,970 [262bec7f-3539-0dd7-6fea-f2959f9df3b6:frag:0:0] DEBUG o.a.d.e.w.f.FragmentStatusReporter
- Closing org.apache.drill.exec.work.fragment.FragmentStatusReporter@75e9a886
> 2017-10-04 03:30:02,970 [drill-executor-42] DEBUG o.a.d.exec.rpc.control.WorkEventBus
- Removing fragment manager: 262bec7f-3539-0dd7-6fea-f2959f9df3b6:0:0
> 2017-10-04 12:49:47,660 [UserServer-1] WARN  o.a.drill.exec.rpc.user.UserServer - Message
of mode REQUEST of rpc type 3 took longer than 500ms.  Actual duration was 39739962ms.
> 2017-10-04 12:49:47,662 [UserServer-1] ERROR o.a.d.exec.rpc.RpcExceptionHandler - Exception
in RPC communication.  Connection: /10.10.100.190:31010 <--> /10.10.100.190:37568 (user
server).  Closing connection.
> io.netty.handler.codec.DecoderException: org.apache.drill.common.exceptions.DrillRuntimeException:
unable to put 
> 	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:99)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.handler.timeout.ReadTimeoutHandler.channelRead(ReadTimeoutHandler.java:150)
[netty-handler-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:242)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:339)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:324)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:847)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: unable to put 
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:300) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:246) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.coord.zk.ZkEphemeralStore.put(ZkEphemeralStore.java:79) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.QueryManager.updateEphemeralState(QueryManager.java:285)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.recordNewState(Foreman.java:1043) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.foreman.Foreman.<init>(Foreman.java:180) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.work.user.UserWorker.submitWork(UserWorker.java:78) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:66)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.user.UserServerRequestHandler.handle(UserServerRequestHandler.java:47)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:157) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.BasicServer.handle(BasicServer.java:53) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:274) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.RpcBus$InboundHandler.decode(RpcBus.java:244) ~[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:89)
[netty-codec-4.0.27.Final.jar:4.0.27.Final]
> 	... 22 common frames omitted
> Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: error while checking
path on zookeeper
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:139)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:107)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.put(ZookeeperClient.java:273) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	... 35 common frames omitted
> Caused by: org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
> 	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:172)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:161)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107) ~[curator-client-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:158)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:148)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:36)
~[curator-framework-2.7.1.jar:na]
> 	at org.apache.drill.exec.coord.zk.ZookeeperClient.hasPath(ZookeeperClient.java:130)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	... 37 common frames omitted
> 2017-10-04 12:49:47,667 [UserServer-1] INFO  o.a.drill.exec.rpc.user.UserServer - RPC
connection /10.10.100.190:31010 <--> /10.10.100.190:37579 (user server) timed out. 
Timeout was set to 30 seconds. Closing connection.
> 2017-10-04 12:49:47,668 [UserServer-1] INFO  o.a.drill.exec.work.foreman.Foreman - Failure
while trying communicate query result to initiating client. This would happen if a client
is disconnected before response notice can be sent.
> org.apache.drill.exec.rpc.ChannelClosedException: Channel closed /10.10.100.190:31010
<--> /10.10.100.190:37579.
> 	at org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:164)
[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at org.apache.drill.exec.rpc.RpcBus$ChannelClosedHandler.operationComplete(RpcBus.java:144)
[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:406) [netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:943)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:592)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:584) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1099)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.ChannelOutboundHandlerAdapter.close(ChannelOutboundHandlerAdapter.java:71)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:615)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:600)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:466)
[netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.handler.timeout.ReadTimeoutHandler.readTimedOut(ReadTimeoutHandler.java:187)
[netty-handler-4.0.27.Final.jar:4.0.27.Final]
> 	at org.apache.drill.exec.rpc.BasicServer$LoggingReadTimeoutHandler.readTimedOut(BasicServer.java:122)
[drill-rpc-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 	at io.netty.handler.timeout.ReadTimeoutHandler$ReadTimeoutTask.run(ReadTimeoutHandler.java:212)
[netty-handler-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) [netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) [netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
[netty-common-4.0.27.Final.jar:4.0.27.Final]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]
> {noformat}
> When I re-connect to Drill and try to show schemas, Drill hangs, and I see this in drillbit.log
> {noformat}
> 2017-10-04 13:01:01,414 [UserServer-1] DEBUG o.a.d.e.r.u.UserServerRequestHandler - Received
query to run.  Returning query handle.
> {noformat}
> I try to stop Drill, but it doesn't stop, so it gets killed.
> {noformat}
> [root@qa-node190 logs.67]# /opt/drill/bin/drillbit.sh stop
> Stopping drillbit
> .........................................................................................................................
> drillbit did not complete after 120 seconds, killing with kill -9 14550
> {noformat}
> I try to start Drill
> {noformat}
> /opt/drill/bin/drillbit.sh start
> {noformat}
> But it fails because it cannot communicate with Zookeeper:
> {noformat}
> 2017-10-04 14:40:32,812 [main] DEBUG o.a.d.e.c.zk.ZKClusterCoordinator - Starting ZKClusterCoordination.
> 2017-10-04 14:40:42,820 [main] ERROR o.apache.drill.exec.server.Drillbit - Failure during
initial startup of Drillbit.
> java.io.IOException: Failure to connect to the zookeeper cluster service within the allotted
time of 10000 milliseconds.
>         at org.apache.drill.exec.coord.zk.ZKClusterCoordinator.start(ZKClusterCoordinator.java:123)
~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.run(Drillbit.java:153) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:343) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 2017-10-04 14:40:42,820 [main] DEBUG o.apache.drill.exec.server.Drillbit - Shutdown begun.
> 2017-10-04 14:40:44,714 [main] WARN  o.apache.drill.exec.server.Drillbit - Failure on
close()
> java.lang.NullPointerException: null
>         at org.apache.drill.exec.work.WorkManager.close(WorkManager.java:152) ~[drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:199) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:346) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.start(Drillbit.java:316) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.main(Drillbit.java:312) [drill-java-exec-1.12.0-SNAPSHOT.jar:1.12.0-SNAPSHOT]
> 2017-10-04 14:40:44,715 [main] INFO  o.apache.drill.exec.server.Drillbit - Shutdown completed
(1894 ms).
> {noformat}
> I normally restart warden and zookeeper to resolve this.  When I stopped warden, it took
20 minutes (normally it should take a few minutes or less):
> {noformat}
> [root@qa-node190 logs.67]# time service mapr-warden stop
> stopping WARDEN
> looking to stop mapr-core processes not started by warden
> real	20m46.637s
> user	0m1.622s
> sys	0m2.432s
> {noformat}
> Stopping zookeeper was quick, and restarting all processes including Drill completed
as expected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message