hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-15296) AM may lose task failures and not reschedule when scheduling to LLAP
Date Mon, 28 Nov 2016 21:22:58 GMT

     [ https://issues.apache.org/jira/browse/HIVE-15296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sergey Shelukhin updated HIVE-15296:
------------------------------------
    Description: 
First attempt and failure detection:
{noformat}
2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=&lt;memory:4096,
vCores:1&gt;, hosts=[3n01]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned
task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0, containerId=null,
assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807} to container
container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, host=3n01:15001
with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, servicesAddress=http://3n01:15002,
mgmtPort=15004]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=10550817928,
containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance [alive=true,
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807}
= SCHEDULED
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt:
[attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056]
on NM: [3n01:15001]
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001
2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|: Successfully
launched task: attempt_1478967587833_2622_1_06_000031_0
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0
sent events: (0-1).
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249,
timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR,
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002,
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED,
endReason=OTHER
2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failed
on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container: container_222212222_2622_01_000056
numFailedTAs: 7
2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task
attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0
with state: FAILED vertexState: RUNNING
{noformat}
Second attempt:
{noformat}
2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=&lt;memory:4096,
vCores:1&gt;, hosts=null
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Preempting
for task=attempt_1478967587833_2622_1_06_000031_1 on any available host
2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned
task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null,
assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} to container
container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, host=1n02:15001
with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, servicesAddress=http://1n02:15002,
mgmtPort=15004]
2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=10551159245,
containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance [alive=true,
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807}
= SCHEDULED
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt:
[attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164]
on NM: [1n02:15001]
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001
2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|: Successfully
launched task: attempt_1478967587833_2622_1_06_000031_1
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1
sent events: (0-1).
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
{noformat}
The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log an error.
After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
The attempt was not running in LLAP.

However, the attempt does come up again when the DAG is killed, long time after it was stuck;
presumably, AM failed to detect the failure (or heartbeat failure for that matter?) and never
tried to reschedule.
{noformat}
2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228,
timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL
is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06
[Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters:
1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
{noformat}

I do see some errors elsewhere to this node (w/Connection refused), but they happen later.
Let me know if you need anything else from AM log.
{noformat}
2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect to server:
1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS):
16000
2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect to server:
1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS):
16000
{noformat}

It seems like a parallel query caused the memory exhaustion on LLAP; most mappers for this
job eventually completed successfully, a few were lost as per above, and the DAG got stuck
because of that.
Relevant slider containers and restarts (the last one seems to be the one where the attempt
was running).
Unfortunately the logs from relevant LLAP containers are missing from aggregation.
{noformat}
2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005', container=ContainerID=container_e55_1478967587833_2465_01_000005
nodeID=3n01:45454  ... diagnostics='Container [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005]
is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory
used; 141.8 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003', container=ContainerID=container_e55_1478967587833_2465_01_000003
nodeID=1n02:45454  ... diagnostics='Container [pid=607173,containerID=container_e55_1478967587833_2465_01_000003]
is running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical memory
used; 142.5 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007', container=ContainerID=container_e55_1478967587833_2465_01_000007
nodeID=3n01:45454  ... diagnostics='Container [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007]
is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory
used; 143.2 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009', container=ContainerID=container_e55_1478967587833_2465_01_000009
nodeID=1n02:45454  ... diagnostics='Container [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009]
is running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical memory
used; 144.4 GB of 294.5 GB virtual memory used. Killing container.

2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
{noformat}


  was:
First attempt and failure detection:
{noformat}
2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=&lt;memory:4096,
vCores:1&gt;, hosts=[3n01]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned
task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0, containerId=null,
assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807} to container
container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true, host=3n01:15001
with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, servicesAddress=http://3n01:15002,
mgmtPort=15004]
2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=10550817928,
containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance [alive=true,
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807}
= SCHEDULED
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt:
[attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056]
on NM: [3n01:15001]
2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001
2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|: Successfully
launched task: attempt_1478967587833_2622_1_06_000031_0
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0
sent events: (0-1).
2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249,
timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR,
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002,
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED,
endReason=OTHER
2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt failed
on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container: container_222212222_2622_01_000056
numFailedTAs: 7
2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source task
attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0
with state: FAILED vertexState: RUNNING
{noformat}
Second attempt:
{noformat}
2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=&lt;memory:4096,
vCores:1&gt;, hosts=null
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Preempting
for task=attempt_1478967587833_2622_1_06_000031_1 on any available host
2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: Assigned
task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0, containerId=null,
assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807} to container
container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true, host=1n02:15001
with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551, servicesAddress=http://1n02:15002,
mgmtPort=15004]
2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|: ScheduleResult
for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=10551159245,
containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance [alive=true,
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807}
= SCHEDULED
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|: TaskAttempt:
[attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164]
on NM: [1n02:15001]
2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001
2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|: Successfully
launched task: attempt_1478967587833_2622_1_06_000031_1
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1
sent events: (0-1).
2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
{noformat}
The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log an error.
After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
The attempt was not running in LLAP.

However, the attempt does come up again when the DAG is killed, long time after it was stuck:
{noformat}
2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228,
timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL
is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06
[Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters:
1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
{noformat}

I do see some errors elsewhere to this node (w/Connection refused), but they happen later.
Let me know if you need anything else from AM log.
{noformat}
2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect to server:
1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS):
16000
2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect to server:
1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in MILLISECONDS):
16000
{noformat}

Relevant slider containers and restarts (the last one seems to be the one where the attempt
was running).
Unfortunately the logs from relevant LLAP containers are missing from aggregation.
{noformat}
2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005', container=ContainerID=container_e55_1478967587833_2465_01_000005
nodeID=3n01:45454  ... diagnostics='Container [pid=1636544,containerID=container_e55_1478967587833_2465_01_000005]
is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory
used; 141.8 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003', container=ContainerID=container_e55_1478967587833_2465_01_000003
nodeID=1n02:45454  ... diagnostics='Container [pid=607173,containerID=container_e55_1478967587833_2465_01_000003]
is running beyond physical memory limits. Current usage: 141.0 GB of 140.3 GB physical memory
used; 142.5 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007', container=ContainerID=container_e55_1478967587833_2465_01_000007
nodeID=3n01:45454  ... diagnostics='Container [pid=1227721,containerID=container_e55_1478967587833_2465_01_000007]
is running beyond physical memory limits. Current usage: 140.3 GB of 140.3 GB physical memory
used; 143.2 GB of 294.5 GB virtual memory used. Killing container.
2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster - Role
instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009', container=ContainerID=container_e55_1478967587833_2465_01_000009
nodeID=1n02:45454  ... diagnostics='Container [pid=1893586,containerID=container_e55_1478967587833_2465_01_000009]
is running beyond physical memory limits. Current usage: 141.2 GB of 140.3 GB physical memory
used; 144.4 GB of 294.5 GB virtual memory used. Killing container.

2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService - Component
operation. Status: COMPLETED; new container state: HEALTHY; new component state: STARTED
{noformat}



> AM may lose task failures and not reschedule when scheduling to LLAP
> --------------------------------------------------------------------
>
>                 Key: HIVE-15296
>                 URL: https://issues.apache.org/jira/browse/HIVE-15296
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Siddharth Seth
>
> First attempt and failure detection:
> {noformat}
> 2016-11-18 20:20:01,980 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_0, priority=65, capability=&lt;memory:4096,
vCores:1&gt;, hosts=[3n01]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65, startTime=0,
containerId=null, assignedInstance=null, uniqueId=55, localityDelayTimeout=9223372036854775807}
to container container_222212222_2622_01_000056 on node=DynamicServiceInstance [alive=true,
host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004]
> 2016-11-18 20:20:01,982 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_0, priority=65,
startTime=10550817928, containerId=container_222212222_2622_01_000056, assignedInstance=DynamicServiceInstance
[alive=true, host=3n01:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://3n01:15002, mgmtPort=15004], uniqueId=55, localityDelayTimeout=9223372036854775807}
= SCHEDULED
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|:
TaskAttempt: [attempt_1478967587833_2622_1_06_000031_0] started. Is using containerId: [container_222212222_2622_01_000056]
on NM: [3n01:15001]
> 2016-11-18 20:20:03,427 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
startTime=1479500403427, containerId=container_222212222_2622_01_000056, nodeId=3n01:15001
> 2016-11-18 20:20:03,430 [INFO] [TaskCommunicator # 1] |tezplugins.LlapTaskCommunicator|:
Successfully launched task: attempt_1478967587833_2622_1_06_000031_0
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_0
sent events: (0-1).
> 2016-11-18 20:20:03,434 [INFO] [IPC Server handler 11 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_0 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
> 2016-11-18 20:25:43,249 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_0,
creationTime=1479500401929, allocationTime=1479500403426, startTime=1479500403427, finishTime=1479500743249,
timeTaken=339822, status=FAILED, taskFailureType=NON_FATAL, errorEnum=TASK_HEARTBEAT_ERROR,
diagnostics=AttemptID:attempt_1478967587833_2622_1_06_000031_0 Timed out after 300 secs, nodeHttpAddress=http://3n01:15002,
counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, DATA_LOCAL_TASKS=1
> 2016-11-18 20:25:43,255 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Processing de-allocate request for task=attempt_1478967587833_2622_1_06_000031_0, state=ASSIGNED,
endReason=OTHER
> 2016-11-18 20:25:43,259 [INFO] [Dispatcher thread {Central}] |node.AMNodeImpl|: Attempt
failed on node: 3n01:15001 TA: attempt_1478967587833_2622_1_06_000031_0 failed: true container:
container_222212222_2622_01_000056 numFailedTAs: 7
> 2016-11-18 20:25:43,262 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source
task attempt completed for vertex: vertex_1478967587833_2622_1_07 [Reducer 2] attempt: attempt_1478967587833_2622_1_06_000031_0
with state: FAILED vertexState: RUNNING
> {noformat}
> Second attempt:
> {noformat}
> 2016-11-18 20:25:43,267 [INFO] [TaskSchedulerEventHandlerThread] |tezplugins.LlapTaskSchedulerService|:
Received allocateRequest. task=attempt_1478967587833_2622_1_06_000031_1, priority=64, capability=&lt;memory:4096,
vCores:1&gt;, hosts=null
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
> 2016-11-18 20:25:43,297 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Preempting for task=attempt_1478967587833_2622_1_06_000031_1 on any available host
> 2016-11-18 20:25:43,298 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=0, containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
= DELAYED_RESOURCES
> 2016-11-18 20:25:43,299 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
Assigned task TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64, startTime=0,
containerId=null, assignedInstance=null, uniqueId=318, localityDelayTimeout=9223372036854775807}
to container container_222212222_2622_01_000164 on node=DynamicServiceInstance [alive=true,
host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004]
> 2016-11-18 20:25:43,300 [INFO] [LlapScheduler] |tezplugins.LlapTaskSchedulerService|:
ScheduleResult for Task: TaskInfo{task=attempt_1478967587833_2622_1_06_000031_1, priority=64,
startTime=10551159245, containerId=container_222212222_2622_01_000164, assignedInstance=DynamicServiceInstance
[alive=true, host=1n02:15001 with resources=&lt;memory:59392, vCores:16&gt;, shufflePort=15551,
servicesAddress=http://1n02:15002, mgmtPort=15004], uniqueId=318, localityDelayTimeout=9223372036854775807}
= SCHEDULED
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |impl.TaskAttemptImpl|:
TaskAttempt: [attempt_1478967587833_2622_1_06_000031_1] started. Is using containerId: [container_222212222_2622_01_000164]
on NM: [1n02:15001]
> 2016-11-18 20:25:43,301 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_STARTED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
startTime=1479500743301, containerId=container_222212222_2622_01_000164, nodeId=1n02:15001
> 2016-11-18 20:25:43,485 [INFO] [TaskCommunicator # 7] |tezplugins.LlapTaskCommunicator|:
Successfully launched task: attempt_1478967587833_2622_1_06_000031_1
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.TaskImpl|: TaskAttempt:attempt_1478967587833_2622_1_06_000031_1
sent events: (0-1).
> 2016-11-18 20:25:43,500 [INFO] [IPC Server handler 5 on 43092] |impl.VertexImpl|: Sending
attempt_1478967587833_2622_1_06_000031_1 24 events [0,24) total 24 vertex_1478967587833_2622_1_06
[Map 1]
> {noformat}
> The IPC thread logs some other activity 250ms later, so it's not stuck nor did it log
an error.
> After this, there are no mentions of any tasks for 2622_1_06_000031 in the log.
> The attempt was not running in LLAP.
> However, the attempt does come up again when the DAG is killed, long time after it was
stuck; presumably, AM failed to detect the failure (or heartbeat failure for that matter?)
and never tried to reschedule.
> {noformat}
> 2016-11-22 08:45:14,228 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1478967587833_2622_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Map 1, taskAttemptId=attempt_1478967587833_2622_1_06_000031_1,
creationTime=1479500743255, allocationTime=1479500743300, startTime=1479500743301, finishTime=1479804314228,
timeTaken=303570927, status=KILLED, errorEnum=TERMINATED_BY_CLIENT, diagnostics=Task KILL
is received. Killing attempt. Diagnostics: Killing tasks in vertex: vertex_1478967587833_2622_1_06
[Map 1] due to trigger: DAG_TERMINATED, nodeHttpAddress=http://1n02:15002, counters=Counters:
1, org.apache.tez.common.counters.DAGCounter, OTHER_LOCAL_TASKS=1
> {noformat}
> I do see some errors elsewhere to this node (w/Connection refused), but they happen later.
Let me know if you need anything else from AM log.
> {noformat}
> 2016-11-18 20:27:15,902 [WARN] [TaskCommunicator # 8] |ipc.Client|: Failed to connect
to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in
MILLISECONDS): 16000
> 2016-11-18 20:27:34,517 [WARN] [TaskCommunicator # 3] |ipc.Client|: Failed to connect
to server: 1n02/(snip).2:15001: retries get failed due to exceeded maximum allowed time (in
MILLISECONDS): 16000
> {noformat}
> It seems like a parallel query caused the memory exhaustion on LLAP; most mappers for
this job eventually completed successfully, a few were lost as per above, and the DAG got
stuck because of that.
> Relevant slider containers and restarts (the last one seems to be the one where the attempt
was running).
> Unfortunately the logs from relevant LLAP containers are missing from aggregation.
> {noformat}
> 2016-11-18 20:20:24,822 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000005',
container=ContainerID=container_e55_1478967587833_2465_01_000005 nodeID=3n01:45454  ... diagnostics='Container
[pid=1636544,containerID=container_e55_1478967587833_2465_01_000005] is running beyond physical
memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 141.8 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:20:31,912 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000003',
container=ContainerID=container_e55_1478967587833_2465_01_000003 nodeID=1n02:45454  ... diagnostics='Container
[pid=607173,containerID=container_e55_1478967587833_2465_01_000003] is running beyond physical
memory limits. Current usage: 141.0 GB of 140.3 GB physical memory used; 142.5 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:25:39,248 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000007',
container=ContainerID=container_e55_1478967587833_2465_01_000007 nodeID=3n01:45454  ... diagnostics='Container
[pid=1227721,containerID=container_e55_1478967587833_2465_01_000007] is running beyond physical
memory limits. Current usage: 140.3 GB of 140.3 GB physical memory used; 143.2 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:26:30,250 [AMRM Callback Handler Thread] ERROR appmaster.SliderAppMaster
- Role instance RoleInstance{role='LLAP', id='container_e55_1478967587833_2465_01_000009',
container=ContainerID=container_e55_1478967587833_2465_01_000009 nodeID=1n02:45454  ... diagnostics='Container
[pid=1893586,containerID=container_e55_1478967587833_2465_01_000009] is running beyond physical
memory limits. Current usage: 141.2 GB of 140.3 GB physical memory used; 144.4 GB of 294.5
GB virtual memory used. Killing container.
> 2016-11-18 20:20:41,060 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:20:49,171 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:25:55,942 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> 2016-11-18 20:26:47,019 [145589505@qtp-14633842-4] INFO  agent.AgentProviderService -
Component operation. Status: COMPLETED; new container state: HEALTHY; new component state:
STARTED
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message