tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TEZ-2863) Container, node, and logs not available in UI for tasks that fail to launch
Date Mon, 29 Feb 2016 04:29:18 GMT

    [ https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171388#comment-15171388
] 

Jeff Zhang commented on TEZ-2863:
---------------------------------

 [~jeagles] I attached 2 addendum patch based on your patch. [~hitesh] Please help review
the addendum patch. 

Recovery event handling is in a different thread as ATS event handling. It is possible that
TaskAttemptFinshedEvent in logged as recovery event but has not been logged as ATS event.
 So we need to recover TaskAttemptFinished properly to ensure we send the correct ATS event
in recovering. 

* add containerId in TaskAttemptFinishedEvent as recovery part
* nodeHttpAddress is not not needed for completedLogsUrl

> Container, node, and logs not available in UI for tasks that fail to launch
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-2863
>                 URL: https://issues.apache.org/jira/browse/TEZ-2863
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, TEZ-2863.2.patch,
TEZ-2863.3-branch-0.7.patch, TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, TEZ-2863.3.patch.addendum
>
>
> While running a sample tez job
> {noformat}
> tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 -Dtez.task.launch.cmd-opts="-Xmx1m"
input output
> {noformat}
> It was noticed that the Tez UI task attempt http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id
was missing the TEZ_ATTEMPT_STARTED event
> {noformat}
> 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_000000,
scheduledTime=1443711835342, launchTime=1443711835342
> 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] |util.RackResolver|:
Resolved localhost to /default-rack
> 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|:
Allocation request for task: attempt_1443711816411_0001_1_00_000000_0 with request: Capability[<memory:1,
vCores:1>]Priority[2] host: localhost rack: null
> 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] |impl.AMRMClientImpl|: Received
new token for : localhost:57381
> 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|:
Assigning container to task: containerId=container_1443711816411_0001_01_000002, task=attempt_1443711816411_0001_1_00_000000_0,
containerHost=localhost:57381, containerPriority= 2, containerResources=<memory:1024, vCores:1>,
localityMatchType=NodeLocal, matchedLocation=localhost, honorLocalityFlags=true, reusedContainer=false,
delayedContainers=0
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] |node.AMNodeTracker|: Adding
new node: localhost:57381
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |launcher.ContainerLauncherImpl|:
Launching container_1443711816411_0001_01_000002
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |impl.ContainerManagementProtocolProxy|:
Opening proxy : localhost:57381
> 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] |history.HistoryEventHandler|:
[HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: containerId=container_1443711816411_0001_01_000002,
launchTime=1443711836741
> 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|:
Allocated container completed:container_1443711816411_0001_01_000002 last allocated to task:
attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] |container.AMContainerImpl|:
Container container_1443711816411_0001_01_000002 exited with diagnostics set to Container
failed, exitCode=1. Exception from container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:455)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> 2015-10-01 10:03:57,649 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:CONTAINER_STOPPED]: containerId=container_1443711816411_0001_01_000002,
stoppedTime=1443711837649, exitStatus=1
> 2015-10-01 10:03:57,652 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Tokenizer,
taskAttemptId=attempt_1443711816411_0001_1_00_000000_0, creationTime=1443711835341, allocationTime=0,
startTime=0, finishTime=1443711837650, timeTaken=1443711837650, status=FAILED, errorEnum=CONTAINER_EXITED,
diagnostics=Container container_1443711816411_0001_01_000002 finished with diagnostics set
to [Container failed, exitCode=1. Exception from container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:455)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> ], counters=Counters: 0
> 2015-10-01 10:03:57,653 [INFO] [Dispatcher thread {Central}] |counters.Limits|: Counter
limits initialized with parameters:  GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64,
MAX_COUNTERS=1200
> 2015-10-01 10:03:57,657 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Scheduling
new attempt for task: task_1443711816411_0001_1_00_000000, currentFailedAttempts: 1, maxFailedAttempts:
4
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|:
Ignoring removal of unknown task: attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.TaskSchedulerEventHandler|:
Task: attempt_1443711816411_0001_1_00_000000_0 has no container assignment in the scheduler
> 2015-10-01 10:03:57,658 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source
task attempt completed for vertex: vertex_1443711816411_0001_1_01 [Summation] attempt: attempt_1443711816411_0001_1_00_000000_0
with state: FAILED vertexState: RUNNING
> {noformat}
> The TASK_ATTEMPTED_STARTED event contains the inProgressURL, however. It makes it very
difficult for the user to debug their bad jvm args.
> The _stdout_ for the failed container has the launch failure reason in this sceario
> {noformat} 
> Error occurred during initialization of VM
> Too small initial heap for new size specified
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message