hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (Reopened) (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (MAPREDUCE-3969) ConcurrentModificationException in JobHistory.java
Date Fri, 23 Mar 2012 18:21:29 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Harsh J reopened MAPREDUCE-3969:
--------------------------------

      Assignee: Harsh J

Arun,

Did you get a chance to check the issue before closing this? This is very well present in
branch-1 too. Reported version was set to 0.20.2 but it was CDH3 per his environment field,
which shares much of the same security code, making it worthwhile to at least take a look
at.

Its very rare to hit this, but it is present and it is pretty critical once it hits (and requires
a specific TT to restart to resume the tasks, else the job just hangs).

The fix would be to synchronize the writers list object before entering the loop that can
potentially modify the list, inside of JobHistory.log(…):

{code}
+ synchronized (writers) {
      for (Iterator<PrintWriter> iter = writers.iterator(); iter.hasNext();) {
        PrintWriter out = iter.next();
        out.println(builder.toString());
        if (out.checkError() && id != null) {
          LOG.info("Logging failed for job " + id + " removing PrintWriter from FileManager");
          iter.remove();
        }
      }
+    }
{code}

Its the {{iter.remove();}} conditional call that causes the issue, as it is done without a
lock (while others may access the same set of writers in parallel - the fact which makes this
doubly very rare to hit).

Do you agree with the above analysis Arun? If so, I'll post up a patch. Else lemme know what
I've missed here.
                
> ConcurrentModificationException in JobHistory.java
> --------------------------------------------------
>
>                 Key: MAPREDUCE-3969
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3969
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: cdh3u1 hadoop distribution, centos 5.5.
>            Reporter: Alexey Zotov
>            Assignee: Harsh J
>
> {code}
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201202150320_3709_m_000148_0'
has completed task_201202150320_3709_m_000148 successfully.
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.mapred.JobHistory: Logging failed for
job job_201202150320_3709removing PrintWriter from FileManager
> 2012-03-01 04:24:47,479 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8021,
call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@61069281, false, false, true, -21317)
from <TASKTRACKER-IP>:450
> 41: error: java.io.IOException: java.util.ConcurrentModificationException
> java.io.IOException: java.util.ConcurrentModificationException
>         at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
>         at java.util.AbstractList$Itr.next(AbstractList.java:343)
>         at org.apache.hadoop.mapred.JobHistory.log(JobHistory.java:591)
>         at org.apache.hadoop.mapred.JobHistory$MapAttempt.logFinished(JobHistory.java:1735)
>         at org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2515)
>         at org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:1200)
>         at org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:4539)
>         at org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:3503)
>         at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:3202)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1428)
> {code}
> Task task_201202150320_3709_m_000148 was being marked as failed (but had not been restarted)
at that moment and job execution was being freezed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message