hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
Date Mon, 06 Feb 2012 15:03:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13201329#comment-13201329
] 

Robert Joseph Evans commented on MAPREDUCE-3802:
------------------------------------------------

That appears to be the case. We are getting an NPE which is caused by calling RecoverService.getTaskAttemptInfo()
and getting a null back.  RecoveryService.getTaskAttemptInfo() first gets a task info, and
then gets a task attempt info from inside that task.  It looks like the task info is parsed
and populated just fine, but the task attempt info is not.  That seems to be caused by no
TaskAttemptStarted events being put in the history log at all during the recovery process.
 This also seems like no MapAttemptFinishedEvents, ReduceAttemp0tFinishedEvents, TaskAttemptFailedEvents
nor TaskAttemptFinishedEvents are in the log either, or we would get null pointer exceptions
while parsing them too.
                
> If an MR AM dies twice  it looks like the process freezes
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3802
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Robert Joseph Evans
>            Priority: Critical
>         Attachments: syslog
>
>
> It looks like recovering from an RM AM dieing works very well on a single failure.  But
if it fails multiple times we appear to get into a live lock situation.
> {noformat}
> yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount
-Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output
> 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use
fs.defaultFS
> 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated.
Instead, use mapreduce.client.genericoptionsparser.used
> 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17
> 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
> 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003
to ResourceManager at HOST/IP:8040
> 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/
> 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
> 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode
: false
> 12/02/03 21:07:03 INFO mapreduce.Job:  map 0% reduce 0%
> 12/02/03 21:07:09 INFO mapreduce.Job:  map 5% reduce 0%
> 12/02/03 21:07:10 INFO mapreduce.Job:  map 17% reduce 0%
> #KILLED AM with kill -9 here
> 12/02/03 21:07:16 INFO mapreduce.Job:  map 29% reduce 0%
> 12/02/03 21:07:17 INFO mapreduce.Job:  map 35% reduce 0%
> 12/02/03 21:07:30 INFO mapreduce.Job:  map 52% reduce 0%
> 12/02/03 21:07:35 INFO mapreduce.Job:  map 58% reduce 0%
> 12/02/03 21:07:37 INFO mapreduce.Job:  map 70% reduce 0%
> 12/02/03 21:07:41 INFO mapreduce.Job:  map 76% reduce 0%
> 12/02/03 21:07:43 INFO mapreduce.Job:  map 82% reduce 0%
> 12/02/03 21:07:44 INFO mapreduce.Job:  map 88% reduce 0%
> 12/02/03 21:07:47 INFO mapreduce.Job:  map 94% reduce 0%
> 12/02/03 21:07:49 INFO mapreduce.Job:  map 100% reduce 0%
> 12/02/03 21:07:53 INFO mapreduce.Job:  map 100% reduce 3%
> 12/02/03 21:08:00 INFO mapreduce.Job:  map 100% reduce 6%
> 12/02/03 21:08:06 INFO mapreduce.Job:  map 100% reduce 10%
> 12/02/03 21:08:12 INFO mapreduce.Job:  map 100% reduce 13%
> 12/02/03 21:08:18 INFO mapreduce.Job:  map 100% reduce 16%
> #killed AM with kill -9 here
> 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 0 time(s).
> 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 1 time(s).
> 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 2 time(s).
> 12/02/03 21:08:26 INFO mapreduce.Job:  map 64% reduce 16%
> #It never makes any more progress...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message