hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj Das (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-4595) JVM Reuse triggers RuntimeException("Invalid state")
Date Mon, 10 Nov 2008 05:14:44 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Devaraj Das updated HADOOP-4595:
--------------------------------

    Attachment: 4595.patch

This patch fixes a race condition in updating free slot count when the load is high (leading
to lost TTs). When a TT reinits, the TaskLauncher object is created again. A task that is
currently running might end up incrementing the free slots of the new TaskLauncher object
if it takes time to exit. This would lead to the behavior described by Aaron in the bug report.
The patch fixes this by moving all code to do with incrementing free slots to one method and
is done inline in TaskInProgress.kill

In addition, the patch fixes a race condition to do with starting MapEventsFetcher thread.
The thread starts the loop after looking at TaskTracker.running flag. However, when a TT reinits,
the running field is set to true only after the thread is spawned. If the thread is immediately
scheduled, it will find running false and exit. This would lead to hung reduces.

I also cleaned up some code to do with TIP.cleanup during a task launch.


> JVM Reuse triggers RuntimeException("Invalid state")
> ----------------------------------------------------
>
>                 Key: HADOOP-4595
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4595
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Aaron Kimball
>            Assignee: Devaraj Das
>             Fix For: 0.19.0
>
>         Attachments: 4595.patch
>
>
> A Reducer triggers the following exception:
> 08/11/05 08:58:50 INFO mapred.JobClient: Task Id : attempt_200811040110_0230_r_000008_1,
Status : FAILED
> java.lang.RuntimeException: Inconsistent state!!! JVM Manager reached an unstable state
while reaping a JVM for task: attempt_200811040110_0230_r_000008_1 Number of active JVMs:2
>  JVMId jvm_200811040110_0230_r_-735233075 #Tasks ran: 0 Currently busy? true Currently
running: attempt_200811040110_0230_r_000012_0
>  JVMId jvm_200811040110_0230_r_-1716942642 #Tasks ran: 0 Currently busy? true Currently
running: attempt_200811040110_0230_r_000040_0
>    at java.lang.Throwable.<init>(Throwable.java:67)
>    at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:245)
>    at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)
>    at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:78)
>    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:410) 
> Other clues:
> In the three reduce task attempts where this was observed, this was attempt _1. Attempt
_0 had started and eventually switches to "SUCCEEDED." So I think this is happening only on
speculatively-executed reduce task attempts. The reduce output (part-XXXXX) gets lost when
this attempt fails, even though the other (earlier) attempt succeeded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message