tez-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TEZ-2863) Container, node, and logs not available in UI for tasks that fail to launch
Date Mon, 29 Feb 2016 04:09:18 GMT

     [ https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jeff Zhang updated TEZ-2863:
----------------------------
    Attachment: TEZ-2863.3.patch.addendum
                TEZ-2863.3-branch-0.7.patch.addendum

> Container, node, and logs not available in UI for tasks that fail to launch
> ---------------------------------------------------------------------------
>
>                 Key: TEZ-2863
>                 URL: https://issues.apache.org/jira/browse/TEZ-2863
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Jonathan Eagles
>         Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, TEZ-2863.2.patch,
TEZ-2863.3-branch-0.7.patch, TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, TEZ-2863.3.patch.addendum
>
>
> While running a sample tez job
> {noformat}
> tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 -Dtez.task.launch.cmd-opts="-Xmx1m"
input output
> {noformat}
> It was noticed that the Tez UI task attempt http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id
was missing the TEZ_ATTEMPT_STARTED event
> {noformat}
> 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_000000,
scheduledTime=1443711835342, launchTime=1443711835342
> 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] |util.RackResolver|:
Resolved localhost to /default-rack
> 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|:
Allocation request for task: attempt_1443711816411_0001_1_00_000000_0 with request: Capability[<memory:1,
vCores:1>]Priority[2] host: localhost rack: null
> 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] |impl.AMRMClientImpl|: Received
new token for : localhost:57381
> 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] |rm.YarnTaskSchedulerService|:
Assigning container to task: containerId=container_1443711816411_0001_01_000002, task=attempt_1443711816411_0001_1_00_000000_0,
containerHost=localhost:57381, containerPriority= 2, containerResources=<memory:1024, vCores:1>,
localityMatchType=NodeLocal, matchedLocation=localhost, honorLocalityFlags=true, reusedContainer=false,
delayedContainers=0
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: Resolved
localhost to /default-rack
> 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] |node.AMNodeTracker|: Adding
new node: localhost:57381
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |launcher.ContainerLauncherImpl|:
Launching container_1443711816411_0001_01_000002
> 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] |impl.ContainerManagementProtocolProxy|:
Opening proxy : localhost:57381
> 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] |history.HistoryEventHandler|:
[HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: containerId=container_1443711816411_0001_01_000002,
launchTime=1443711836741
> 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] |rm.YarnTaskSchedulerService|:
Allocated container completed:container_1443711816411_0001_01_000002 last allocated to task:
attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] |container.AMContainerImpl|:
Container container_1443711816411_0001_01_000002 exited with diagnostics set to Container
failed, exitCode=1. Exception from container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:455)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> 2015-10-01 10:03:57,649 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:CONTAINER_STOPPED]: containerId=container_1443711816411_0001_01_000002,
stoppedTime=1443711837649, exitStatus=1
> 2015-10-01 10:03:57,652 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|:
[HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=Tokenizer,
taskAttemptId=attempt_1443711816411_0001_1_00_000000_0, creationTime=1443711835341, allocationTime=0,
startTime=0, finishTime=1443711837650, timeTaken=1443711837650, status=FAILED, errorEnum=CONTAINER_EXITED,
diagnostics=Container container_1443711816411_0001_01_000002 finished with diagnostics set
to [Container failed, exitCode=1. Exception from container-launch.
> Container id: container_1443711816411_0001_01_000002
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:455)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> ], counters=Counters: 0
> 2015-10-01 10:03:57,653 [INFO] [Dispatcher thread {Central}] |counters.Limits|: Counter
limits initialized with parameters:  GROUP_NAME_MAX=256, MAX_GROUPS=500, COUNTER_NAME_MAX=64,
MAX_COUNTERS=1200
> 2015-10-01 10:03:57,657 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: Scheduling
new attempt for task: task_1443711816411_0001_1_00_000000, currentFailedAttempts: 1, maxFailedAttempts:
4
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.YarnTaskSchedulerService|:
Ignoring removal of unknown task: attempt_1443711816411_0001_1_00_000000_0
> 2015-10-01 10:03:57,658 [INFO] [TaskSchedulerEventHandlerThread] |rm.TaskSchedulerEventHandler|:
Task: attempt_1443711816411_0001_1_00_000000_0 has no container assignment in the scheduler
> 2015-10-01 10:03:57,658 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Source
task attempt completed for vertex: vertex_1443711816411_0001_1_01 [Summation] attempt: attempt_1443711816411_0001_1_00_000000_0
with state: FAILED vertexState: RUNNING
> {noformat}
> The TASK_ATTEMPTED_STARTED event contains the inProgressURL, however. It makes it very
difficult for the user to debug their bad jvm args.
> The _stdout_ for the failed container has the launch failure reason in this sceario
> {noformat} 
> Error occurred during initialization of VM
> Too small initial heap for new size specified
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message