kylin-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Han <luke...@gmail.com>
Subject Re: 紧急求救,kylin Query机 查询运行20多分钟后死机
Date Tue, 01 May 2018 12:41:06 GMT
Hi Shen,
Please do not use "emergency support' or something like that in the
community, such wording is not good to let people help with your question.
People need more time to understand your problem if they have time and will
answer with their best effort, but no guarantee, please aware this.

"Everyone active in ASF projects is here as a volunteer, nobody is paid to
provide support here."
see here:
https://community.apache.org/newbiefaq.html#how-do-i-get-user-support-for-an-asf-project

for your problem, could you please send one question in one thread?

Thanks.
Luke



Best Regards!
---------------------

Luke Han

On Tue, Apr 24, 2018 at 3:04 PM, 沈鲁威 <jingtian@dianjia.io> wrote:

> There is nothing OOM or overload error in region server log.
>
> Our Hbase version is 1.2.0-cdh
>
>
> 在 2018年4月24日,下午1:59,Ma Gang <mg4work@163.com> 写道:
>
> You may check the region server log, is the related region server OOM or
> overload?
>
>
> 在 2018-04-24 13:47:08,"沈鲁威" <jingtian@dianjia.io> 写道:
> >
> >异常补充
> >ylin.log:Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: org.apache.hadoop.hbase.DoNotRetryIOException:
Coprocessor passed deadline! Maybe server is overloaded
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
> >kylin.log-      at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> >--
> >kylin.log-      at org.apache.hadoop.hbase.ipc.RegionCoprocessorRpcChannel.callExecService(RegionCoprocessorRpcChannel.java:107)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.CoprocessorRpcChannel.callMethod(CoprocessorRpcChannel.java:56)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService$Stub.visitCube(CubeVisitProtos.java:5616)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$2.call(CubeHBaseEndpointRPC.java:237)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$2.call(CubeHBaseEndpointRPC.java:206)
> >kylin.log-      at org.apache.hadoop.hbase.client.HTable$15.call(HTable.java:1800)
> >kylin.log-      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >kylin.log-      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >kylin.log-      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >kylin.log-      ... 1 more
> >kylin.log:Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException):
org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor passed deadline! Maybe server is
overloaded
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
> >kylin.log-      at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
> >kylin.log-      at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
> >kylin.log-      at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
> >kylin.log-      at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183)
> >> 在 2018年4月23日,下午10:51,沈鲁威 <jingtian@dianjia.io> 写道:
> >>
> >> 各位大神:
> >> 我们这边搭建了 cdh5.13.1+kylin.2.3.0
> >> 一台任务机,三台查询机slb 负载均衡(4核8G)
> >>
> >>
> >>
> >> 问题:工作的过程中经常隔断时间,某一台查询机器查询报超时,紧接着所有查询均不可用
> >> 只能kylin.sh stop 停掉这台查询机,其他机器才能正常工作
> >>
> >> 查看机器负载 并不高
> >> 查看日志 出现过的错误日志
> >> 1、ncategorized SQLException for SQL []; SQL state [null]; error code [0];
exception while executing query: java.io.IOException: POST failed, error code 500 and response:
{"code":"999","data":null,"msg":"Timeout visiting cube! Check why coprocessor exception is
not sent back? In coprocessor Self-termination is checked every 100 scanned rows, the configured
timeout(54000) cannot support this many scans?\nwhile executing SQL: \"select COALESCE(SUM(a.total_sale_money_kpi),0)
as total_sale_money_kpi , COALESCE(SUM(a.total_sale_count_kpi),0) as
> >> 2、by total_sale_money_kpi desc ### Cause: java.sql.SQLException: exception
while executing query: java.io.IOException: POST failed, error code 500 and response: {"code":"999","data":null,"msg":"org.apache.hadoop.hbase.DoNotRetryIOException:
org.apache.hadoop.hbase.DoNotRetryIOException: Coprocessor passed deadline! Maybe server is
overloaded at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.checkDeadline(CubeVisitService.java:225)
at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.CubeVisitService.visitCube(CubeVisitService.java:259)
at org.apache.kylin.storage.hbase.cube.v2.coprocessor.endpoint.generated.CubeVisitProtos$CubeVisitService.callMethod(CubeVisitProtos.java:5555)
at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7931) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1969)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1951)
at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2191) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:183) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163)\nwhile
executing SQL:
> >>
> >>
> >> <CE1ED564E277BCD093CB59000F043C9F.png>
> >>
> >>
> >>
> >> jstack 查看日志
> >>
> >> 情况1:
> >> 有很多线程在等待同一个锁 多的话有100多个
> >> 怀疑可能有个锁被锁住了,而且可能是全局锁,因为一台机器有问题其他机器也没法查了
> >>
> >>
> >> "kylin-coproc--pool2-t82051" #93742 daemon prio=5 os_prio=0 tid=0x00007f314d435800
nid=0x1fb waiting on condition [0x00007f315abad000]
> >>   java.lang.Thread.State: TIMED_WAITING (parking)
> >> 	at sun.misc.Unsafe.park(Native Method)
> >> 	- parking to wait for  <0x00000007008eeff8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> 	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> >> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >> 	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> >> 	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
> >> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> 	at java.lang.Thread.run(Thread.java:748)
> >>
> >>   Locked ownable synchronizers:
> >> 	- None
> >>
> >> "kylin-coproc--pool2-t82050" #93741 daemon prio=5 os_prio=0 tid=0x00007f314dc24800
nid=0x1fa waiting on condition [0x00007f315c1bb000]
> >>   java.lang.Thread.State: TIMED_WAITING (parking)
> >> 	at sun.misc.Unsafe.park(Native Method)
> >> 	- parking to wait for  <0x00000007008eeff8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >> 	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> >> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> >> 	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
> >> 	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
> >> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> >> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> 	at java.lang.Thread.run(Thread.java:748)
> >>
> >>   Locked ownable synchronizers:
> >> 	- None
> >>
> >>
> >>
> >> 情况2:
> >> 线程池的问题:但是目前没找到哪类设置的线程池数量
> >>
> >> 2018-04-22 10:56:13,407 ERROR [pool-10-thread-806] v2.CubeHBaseEndpointRPC:340
: <sub-thread for Query 492811-3d81d0ee-b6c9-443b-b652-3f94f5072cd1-1524365662180 GTScanRequest
1578e6c>Error when visiting cubes by endpoint
> >> java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@6006a8c3
rejected from java.util.concurrent.ThreadPoolExecutor@276cb5e4[Shutting down, pool size =
19, active threads = 19, queued tasks = 0, completed tasks = 90389]
> >> at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> >> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> >> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> >> at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> >> at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1795)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.runEPRange(CubeHBaseEndpointRPC.java:205)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.access$000(CubeHBaseEndpointRPC.java:69)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$1.run(CubeHBaseEndpointRPC.java:186)
> >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> at java.lang.Thread.run(Thread.java:748)
> >> 2018-04-22 10:56:13,407 DEBUG [Query 492811-e0a95289-a23e-4eb2-a1d2-e0000fd66ac4-1524365662196-116]
gtrecord.GTCubeStorageQueryBase:311 : Need storage aggregation
> >> 2018-04-22 10:56:13,408 INFO  [Query 123629-8888aa31-e163-41c7-84d2-4b06a6b8da18-1524365659125-143]
service.QueryService:1134 : Processed rows for each storageContext: 7
> >> 2018-04-22 10:56:13,408 ERROR [pool-10-thread-800] v2.CubeHBaseEndpointRPC:340
: <sub-thread for Query 492811-3d81d0ee-b6c9-443b-b652-3f94f5072cd1-1524365662180 GTScanRequest
5677c55d>Error when visiting cubes by endpoint
> >> java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@6006a8c3
rejected from java.util.concurrent.ThreadPoolExecutor@276cb5e4[Shutting down, pool size =
19, active threads = 19, queued tasks = 0, completed tasks = 90389]
> >> at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> >> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> >> at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> >> at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
> >> at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1795)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.runEPRange(CubeHBaseEndpointRPC.java:205)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC.access$000(CubeHBaseEndpointRPC.java:69)
> >> at org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$1.run(CubeHBaseEndpointRPC.java:186)
> >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >> at java.lang.Thread.run(Thread.java:748)
> >>
> >> <091BC280DBCABF12925C7456BF791602.jpg>
> >>
> >>
> >> 情况3:出现过如下错误
> >> hangzhou.dianjia.io trying to unlock /kylin/kylin_metadata/job_engine/global_job_engine_lock
> >> kylin.out:      at org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlock(ZookeeperDistributedLock.java:236)
> >> kylin.out:      at org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlockJobEngine(ZookeeperDistributedLock.java:311)
> >> kylin.out:      at org.apache.kylin.storage.hbase.util.ZookeeperJobLock.unlockJobEngine(ZookeeperJobLock.java:86)
> >> kylin.out-      at org.apache.kylin.job.impl.threadpool.DefaultScheduler.shutdown(DefaultScheduler.java:234)
> >> kylin.out-      at org.apache.kylin.rest.service.JobService$2.run(JobService.java:140)
> >> kylin.out-      at java.lang.Thread.run(Thread.java:748)
> >> kylin.out-Caused by: java.lang.IllegalStateException: Client is not started
> >> kylin.out-      at com.google.common.base.Preconditions.checkState(Preconditions.java:149)
> >> kylin.out:      at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
> >> kylin.out-      at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:477)
> >> kylin.out-      at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:238)
> >> kylin.out-      at org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:233)
> >> kylin.out-      at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
> >> kylin.out-      at org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:230)
> >> kylin.out-      at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:214)
> >> kylin.out-      at org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:41)
> >> kylin.out:      at org.apache.kylin.storage.hbase.util.ZookeeperDistributedLock.unlock(ZookeeperDistributedLock.java:231)
> >>
> >>
> >> 怀疑过如下代码:
> >> 但是我们验证过去掉同步锁 但是情况依旧。
> >> 多种情况下是下图66666到77777这个之间执行很慢。
> >> <B5ED37ABAEE71EB70911E69D10DD3252.png>
> >>
> >>
> >>
> >>
> >>
> >> <kylin配置.txt>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
>
>
>
>

Mime
View raw message