hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Siddharth Seth (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP
Date Wed, 04 Jan 2017 00:03:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796623#comment-15796623
] 

Siddharth Seth commented on HIVE-15296:
---------------------------------------

[~sershe] - thanks for catching this and fixing it. https://issues.apache.org/jira/browse/HIVE-15163
is related to a large extent. Didn't see the potential problem with the wait-queue unfortunately
:(
The eventual plan was to have a custom NodeId possible in Tez, which can then be defined by
LLAP (linked to the ZK registered nodeId). That's essentially why LlapNodeId differs from
YARN NodeId, which is used all over the place in Tez. Think this fixes it nicely for now though.

> AM may lose task failures and not reschedule when scheduling to LLAP
> --------------------------------------------------------------------
>
>                 Key: HIVE-15296
>                 URL: https://issues.apache.org/jira/browse/HIVE-15296
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>             Fix For: 2.2.0
>
>         Attachments: HIVE-15296.01.patch, HIVE-15296.patch, HIVE-15296.patch
>
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=&lt;memory:4096,
vCores:1&gt;, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0,
containerId=null, assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807}
to container container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true,
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65,
startTime=10550817928, containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance
[alive=true, host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807}
= SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|:
TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056]
on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|:
Successfully launched task: attempt_1478967587833_2622_1_06_000031_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0
sent events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249,
timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR,
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002,
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED,
endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt
failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container:
container_222212222_2622_01_000056 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source
task attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0
with state: FAILED vertexState: RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=&lt;memory:4096,
vCores:1&gt;, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Preempting for task=attempt_1478967587833_2622_1_06_000031_1 on any available host
> 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
> 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
to container container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true,
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004]
> 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=10551159245, containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance
[alive=true, host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807}
= SCHEDULED
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|:
TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164]
on NM: [1n02:15001]
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001
> 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|:
Successfully launched task: attempt_1478967587833_2622_1_06_000031_1
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1
sent events: (0-1).
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
> {noformat}
> The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log
an error.
> After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
> The attempt was not running in LLAP.
> However, the attempt does come up again when the DAG is killed, long time after it was
stuck; presumably, AM failed to detect the failure (or heartbeat failure for that matter?)
and never tried to reschedule.
> {noformat}
> 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228,
timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL
is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06
[Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters:
1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
> {noformat}
> I do see some errors elsewhere to this node (w/Connection refused), but they happen later.
Let me know if you need anything else from AM log.
> {noformat}
> 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect
to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in
MILLISECONDS): 16000
> 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect
to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in
MILLISECONDS): 16000
> {noformat}
> It seems like a parallel query caused the memory exhaustion on LLAP; most mappers for
this job eventually completed successfully, a few were lost as per above, and the DAG got
stuck because of that.
> Relevant slider containers and restarts (the last one seems to be the one where the attempt
was running).
> Unfortunately the logs from relevant LLAP containers are missing from aggregation.
> {noformat}
> 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005',
container=ContainerID=container_e55_1478967587833_2465_01_000005 nodeID=3n01:45454  ... diagnostics='Container
[pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running beyond physical
memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 141.8 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003',
container=ContainerID=container_e55_1478967587833_2465_01_000003 nodeID=1n02:45454  ... diagnostics='Container
[pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running beyond physical
memory limits. Current usage: 141.0 GB of 140.3 GB physical memory used; 142.5 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007',
container=ContainerID=container_e55_1478967587833_2465_01_000007 nodeID=3n01:45454  ... diagnostics='Container
[pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running beyond physical
memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 143.2 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009',
container=ContainerID=container_e55_1478967587833_2465_01_000009 nodeID=1n02:45454  ... diagnostics='Container
[pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running beyond physical
memory limits. Current usage: 141.2 GB of 140.3 GB physical memory used; 144.4 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message