hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time
Date Thu, 02 Feb 2012 05:37:53 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kumar Vavilapalli updated MAPREDUCE-3711:
-----------------------------------------------

       Fix Version/s: 0.23.1
    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)

Did a full review, quite an involved change. The main FileOutputCommitter(FOC) changes are
fine, some comments:
 - The condition in RecoveryService is wrong. It should be (!(!(iAmAMap && numReduces
== 0)). Or simpler, (iAMReduce || numReduces > 0). Please see if you can add a test case
validating different cases possible here (only maps, maps + reduces cross with recovery for
maps, recovery for reduces)
 - This bug also happens with mapred.FOC, which we need to fix. While you are at it, please
see if we can reuse code, there are large chunks of code that are the same in mapred.FOC and
mapreduce.lib.input.FOC.
 - A test which recovers multiple tasks would have caught this issue, can you do that in this
patch?
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3711.txt, MR-3711.txt
>
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 maps were
completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very less time
to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% maps were
completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all maps were
completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 mis when I
killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message