hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3802) If an MR AM dies twice it looks like the process freezes
Date Tue, 07 Feb 2012 18:10:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202585#comment-13202585
] 

Robert Joseph Evans commented on MAPREDUCE-3802:
------------------------------------------------

OK I found the issue, sort of, and it has nothing to do with order.  The issue is with the
name of the task attempt.  If the task was completed by the first AM and recovered by the
second AM the name of the task attempt in the jhist file will look like attempt_1328637230353_0001_m_000000_0,
but the Recovery Service is trying to recover a task with attempt id attempt_1328637230353_0001_m_000000_1000,
which appears to be the format for attempts that completed successfully with the second AM.

I need to understand a little bit better how these names are determined, and where they are
set, so I can determine how to fix the issue.  I don't see how this could only be a problem
for a single node cluster.
                
> If an MR AM dies twice  it looks like the process freezes
> ---------------------------------------------------------
>
>                 Key: MAPREDUCE-3802
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3802
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.1, 0.24.0
>            Reporter: Robert Joseph Evans
>            Assignee: Robert Joseph Evans
>            Priority: Critical
>         Attachments: syslog
>
>
> It looks like recovering from an RM AM dieing works very well on a single failure.  But
if it fails multiple times we appear to get into a live lock situation.
> {noformat}
> yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*-SNAPSHOT.jar wordcount
-Dyarn.app.mapreduce.am.log.level=DEBUG -Dmapreduce.job.reduces=30 input output
> 12/02/03 21:06:57 WARN conf.Configuration: fs.default.name is deprecated. Instead, use
fs.defaultFS
> 12/02/03 21:06:57 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated.
Instead, use mapreduce.client.genericoptionsparser.used
> 12/02/03 21:06:57 INFO input.FileInputFormat: Total input paths to process : 17
> 12/02/03 21:06:57 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/02/03 21:06:57 WARN snappy.LoadSnappy: Snappy native library not loaded
> 12/02/03 21:06:57 INFO mapreduce.JobSubmitter: number of splits:17
> 12/02/03 21:06:57 INFO mapred.ResourceMgrDelegate: Submitted application application_1328302034486_0003
to ResourceManager at HOST/IP:8040
> 12/02/03 21:06:57 INFO mapreduce.Job: The url to track the job: http://HOST:8088/proxy/application_1328302034486_0003/
> 12/02/03 21:06:57 INFO mapreduce.Job: Running job: job_1328302034486_0003
> 12/02/03 21:07:03 INFO mapreduce.Job: Job job_1328302034486_0003 running in uber mode
: false
> 12/02/03 21:07:03 INFO mapreduce.Job:  map 0% reduce 0%
> 12/02/03 21:07:09 INFO mapreduce.Job:  map 5% reduce 0%
> 12/02/03 21:07:10 INFO mapreduce.Job:  map 17% reduce 0%
> #KILLED AM with kill -9 here
> 12/02/03 21:07:16 INFO mapreduce.Job:  map 29% reduce 0%
> 12/02/03 21:07:17 INFO mapreduce.Job:  map 35% reduce 0%
> 12/02/03 21:07:30 INFO mapreduce.Job:  map 52% reduce 0%
> 12/02/03 21:07:35 INFO mapreduce.Job:  map 58% reduce 0%
> 12/02/03 21:07:37 INFO mapreduce.Job:  map 70% reduce 0%
> 12/02/03 21:07:41 INFO mapreduce.Job:  map 76% reduce 0%
> 12/02/03 21:07:43 INFO mapreduce.Job:  map 82% reduce 0%
> 12/02/03 21:07:44 INFO mapreduce.Job:  map 88% reduce 0%
> 12/02/03 21:07:47 INFO mapreduce.Job:  map 94% reduce 0%
> 12/02/03 21:07:49 INFO mapreduce.Job:  map 100% reduce 0%
> 12/02/03 21:07:53 INFO mapreduce.Job:  map 100% reduce 3%
> 12/02/03 21:08:00 INFO mapreduce.Job:  map 100% reduce 6%
> 12/02/03 21:08:06 INFO mapreduce.Job:  map 100% reduce 10%
> 12/02/03 21:08:12 INFO mapreduce.Job:  map 100% reduce 13%
> 12/02/03 21:08:18 INFO mapreduce.Job:  map 100% reduce 16%
> #killed AM with kill -9 here
> 12/02/03 21:08:20 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 0 time(s).
> 12/02/03 21:08:21 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 1 time(s).
> 12/02/03 21:08:22 INFO ipc.Client: Retrying connect to server: HOST/IP:44223. Already
tried 2 time(s).
> 12/02/03 21:08:26 INFO mapreduce.Job:  map 64% reduce 16%
> #It never makes any more progress...
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message