hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-19935) Hive WM session killed: Failed to update LLAP tasks count
Date Mon, 23 Jul 2018 23:55:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin reassigned HIVE-19935:
---------------------------------------

    Assignee: Sergey Shelukhin  (was: Thai Bui)

> Hive WM session killed: Failed to update LLAP tasks count
> ---------------------------------------------------------
>
>                 Key: HIVE-19935
>                 URL: https://issues.apache.org/jira/browse/HIVE-19935
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 3.1.0
>            Reporter: Thai Bui
>            Assignee: Sergey Shelukhin
>            Priority: Minor
>
> I'm getting this error with WM feature quite frequently. It causes AM containers to shut
down and a new one created to replace it.
> {noformat}
> 018-06-18T19:06:49,969 INFO [Thread-250] monitoring.RenderStrategy$LogToFileFunction:
Map 1: 313(+270)/641
> 2018-06-18T19:06:49,988 INFO [NotificationEventPoll 0] metastore.HiveMetaStore: 4: get_config_value:
name=metastore.batch.retrieve.max defaultValue=50
> 2018-06-18T19:06:49,988 INFO [NotificationEventPoll 0] HiveMetaStore.audit: ugi=hive
ip=unknown-ip-addr cmd=get_config_value: name=metastore.batch.retrieve.max defaultValue=50
> 2018-06-18T19:06:50,204 INFO [pool-29-thread-1] tez.TriggerValidatorRunnable: Query:
hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4. Trigger { name: alluxio_medium,
expression: ALLUXIO_BYTES_READ >
> 6442450944, action: MOVE TO medium } violated. Current value: 7184667126. Applying action.
> 2018-06-18T19:06:50,205 INFO [pool-29-thread-1] tez.WorkloadManager: Queued move session:
49be39e5-875c-4cfe-8601-7fe84dd57e0c moving from default to medium
> 2018-06-18T19:06:50,205 INFO [Workload management master] tez.WorkloadManager: Processing
current events
> 2018-06-18T19:06:50,205 INFO [Workload management master] tez.WorkloadManager: Handling
move session event: 49be39e5-875c-4cfe-8601-7fe84dd57e0c moving from default to medium
> 2018-06-18T19:06:50,205 INFO [Workload management master] tez.WorkloadManager: Subscribed
to counters: [S3A_BYTES_READ, BYTES_READ, ALLUXIO_BYTES_READ]
> 2018-06-18T19:06:50,205 INFO [pool-29-thread-1] tez.KillMoveTriggerActionHandler: Moved
session 49be39e5-875c-4cfe-8601-7fe84dd57e0c to pool medium
> 2018-06-18T19:06:50,205 INFO [Workload management master] tez.GuaranteedTasksAllocator:
Updating 49be39e5-875c-4cfe-8601-7fe84dd57e0c with 144 guaranteed tasks
> 2018-06-18T19:06:50,205 INFO [Workload management master] tez.WmEvent: Added WMEvent:
EventType: MOVE EventStartTimestamp: 1529348810205 elapsedTime: 0 wmTezSessionInfo:SessionId:
49be39e5-875c-4cfe-8601-7fe
> 84dd57e0c Pool: medium Cluster %: 30.0
> 2018-06-18T19:06:50,234 INFO [StateChangeNotificationHandler] impl.ZkRegistryBase$InstanceStateChangeListener:
CHILD_UPDATED for zknode /user-hive/llap/workers/worker-0000001571
> 2018-06-18T19:06:50,235 INFO [StateChangeNotificationHandler] tez.TezSessionPool: AM
for 49be39e5-875c-4cfe-8601-7fe84dd57e0c, v.1571 has updated; updating [sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c,
qu
> eueName=llap, user=hive, doAs=false, isOpen=true, isDefault=true, expires in 586277120ms,
WM state poolName=medium, clusterFraction=0.3, queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4,
killR
> eason=null] with an endpoint at 32769
> 2018-06-18T19:06:50,235 INFO [StateChangeNotificationHandler] tez.TezSessionState: Ignoring
an outdated info update 1571: TezAmInstance [49be39e5-875c-4cfe-8601-7fe84dd57e0c, host=ip-10-8-121-231.data.bazaar
> voice.com, rpcPort=33365, pluginPort=32769, token=null]
> 2018-06-18T19:06:50,323 ERROR [TaskCommunicator # 4] tez.GuaranteedTasksAllocator: Failed
to update guaranteed tasks count for the session sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c,
queueName=llap, user=
> hive, doAs=false, isOpen=true, isDefault=true, expires in 586277032ms, WM state poolName=medium,
clusterFraction=0.3, queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, killReason=null
> com.google.protobuf.ServiceException: org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.checkAndSendGuaranteedStateUpdate(LlapTaskSchedulerService.java:596)
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateGuaranteedCount(LlapTaskSchedulerService.java:581)
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateQuery(LlapTaskSchedulerService.java:3041)
> at org.apache.hadoop.hive.llap.tezplugins.endpoint.LlapPluginServerImpl.updateQuery(LlapPluginServerImpl.java:57)
> at org.apache.hadoop.hive.llap.plugin.rpc.LlapPluginProtocolProtos$LlapPluginProtocol$2.callBlockingMethod(LlapPluginProtocolProtos.java:835)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:242)
~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> at com.sun.proxy.$Proxy111.updateQuery(Unknown Source) ~[?:?]
> at org.apache.hadoop.hive.llap.impl.LlapPluginProtocolClientImpl.updateQuery(LlapPluginProtocolClientImpl.java:42)
~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
> at org.apache.hadoop.hive.ql.exec.tez.LlapPluginEndpointClientImpl$SendUpdateQueryCallable.call(LlapPluginEndpointClientImpl.java:128)
~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
> at org.apache.hadoop.hive.ql.exec.tez.LlapPluginEndpointClientImpl$SendUpdateQueryCallable.call(LlapPluginEndpointClientImpl.java:105)
~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
> at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
[guava-19.0.jar:?]
> at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
[guava-19.0.jar:?]
> at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
[guava-19.0.jar:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> Caused by: org.apache.hadoop.ipc.RemoteException: java.lang.NullPointerException
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.checkAndSendGuaranteedStateUpdate(LlapTaskSchedulerService.java:596)
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateGuaranteedCount(LlapTaskSchedulerService.java:581)
> at org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateQuery(LlapTaskSchedulerService.java:3041)
> at org.apache.hadoop.hive.llap.tezplugins.endpoint.LlapPluginServerImpl.updateQuery(LlapPluginServerImpl.java:57)
> at org.apache.hadoop.hive.llap.plugin.rpc.LlapPluginProtocolProtos$LlapPluginProtocol$2.callBlockingMethod(LlapPluginProtocolProtos.java:835)
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> at org.apache.hadoop.ipc.Client.call(Client.java:1437) ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> at org.apache.hadoop.ipc.Client.call(Client.java:1347) ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?]
> ... 11 more
> 2018-06-18T19:06:50,323 INFO [Workload management master] tez.WorkloadManager: Processing
current events
> 2018-06-18T19:06:50,323 INFO [Workload management master] tez.WorkloadManager: Update
failed for sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, doAs=false,
isOpen=true, isDefault=true, expires in 586277032ms, WM state poolName=medium, clusterFraction=0.3,
queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, killReason=null
> 2018-06-18T19:06:50,324 INFO [Workload management master] tez.WorkloadManager: Replacing
sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, doAs=false, isOpen=true,
isDefault=true, expires in 586277031ms, WM state poolName=null, clusterFraction=null, queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4,
killReason=Failed to update resource allocation with a new session
> 2018-06-18T19:06:50,325 INFO [Workload management worker 0] tez.WorkloadManager: Created
new interactive session object b89aaebf-aeaa-4b76-974a-7047592a186b
> 2018-06-18T19:06:50,325 INFO [Workload management worker 0] tez.TezSessionState: Closing
Tez Session
> 2018-06-18T19:06:50,325 INFO [Workload management worker 0] client.TezClient: Shutting
down Tez Session, sessionName=HIVE-49be39e5-875c-4cfe-8601-7fe84dd57e0c, applicationId=application_1528322657674_0427
> 2018-06-18T19:06:50,353 INFO [Workload management worker 0] tez.TezSessionState: Attemting
to clean up resources for 49be39e5-875c-4cfe-8601-7fe84dd57e0c: hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/49be39e5-875c-4cfe-8601-7fe84dd57e0c-resources;
0 additional files, 2 localized resources
> 2018-06-18T19:06:50,354 INFO [Workload management worker 0] tez.TezSessionState: User
of session id b89aaebf-aeaa-4b76-974a-7047592a186b is hive
> 2018-06-18T19:06:50,356 INFO [Workload management worker 0] tez.DagUtils: Localizing
resource because it does not exist: file:/usr/hdp/current/hive-server2-hive2/lib/hive-hcatalog-core.jar
to dest: hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/hive-hcatalog-core.jar
> 2018-06-18T19:06:50,467 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1529348810466 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/hive-hcatalog-core.jar
> 2018-06-18T19:06:50,467 INFO [Workload management worker 0] tez.DagUtils: Localizing
resource because it does not exist: file:/usr/hdp/3.0.0.0-1064/hive2/auxlib/alluxio-core-client-runtime.jar
to dest: hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/alluxio-core-client-runtime.jar
> 2018-06-18T19:06:50,504 INFO [Thread-250] monitoring.RenderStrategy$LogToFileFunction:
Map 1: 361(+0)/641
> 2018-06-18T19:06:50,785 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1529348810784 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/alluxio-core-client-runtime.jar
> 2018-06-18T19:06:50,785 INFO [Workload management worker 0] tez.TezSessionState: Created
new resources: null
> 2018-06-18T19:06:50,786 INFO [Workload management worker 0] tez.DagUtils: Jar dir is
null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - /user/hive/.hiveJars
> 2018-06-18T19:06:50,788 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1525405767020 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar
> 2018-06-18T19:06:50,791 INFO [Workload management worker 0] tez.DagUtils: Jar dir is
null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - /user/hive/.hiveJars
> 2018-06-18T19:06:50,792 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1525405767036 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-llap-tez-3.1.0-SNAPSHOT-1530a700e8bc86603cf11cd8d0a2518f21a7a7c2ac09eb3af7782017ec05c4d3.jar
> 2018-06-18T19:06:50,792 INFO [Workload management worker 0] tez.DagUtils: Jar dir is
null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - /user/hive/.hiveJars
> 2018-06-18T19:06:50,793 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1525405767020 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar
> 2018-06-18T19:06:50,794 INFO [Workload management worker 0] tez.DagUtils: Jar dir is
null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - /user/hive/.hiveJars
> 2018-06-18T19:06:50,796 INFO [Thread-250] SessionState: Status: Killed
> 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Dag received [DAG_TERMINATE,
DAG_KILL] in RUNNING state.
> 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Received message to shutdown
AM from hive (auth:SIMPLE) at 10.8.101.64
> 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Vertex killed, vertexName=Map
1, vertexId=vertex_1528322657674_0427_1_00, diagnostics=[Vertex received Kill while in RUNNING
state., Vertex did not succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:280, Vertex
vertex_1528322657674_0427_1_00 [Map 1] killed/failed due to:DAG_TERMINATED]
> 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: DAG did not succeed due to DAG_KILL.
failedVertices:0 killedVertices:1
> 2018-06-18T19:06:50,797 INFO [Workload management master] tez.WorkloadManager: Processing
current events
> 2018-06-18T19:06:50,797 INFO [Workload management worker 0] tez.DagUtils: Resource modification
time: 1525405767020 for hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar
> 2018-06-18T19:06:50,797 INFO [Workload management master] tez.WorkloadManager: Returning
sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, doAs=false, isOpen=false,
isDefault=true, expires in 586276558ms, WM state poolName=null, clusterFraction=null, queryId=null,
killReason=Failed to update resource allocation{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message