hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts
Date Fri, 11 Apr 2014 22:18:18 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jason Lowe updated MAPREDUCE-5652:

    Attachment: MAPREDUCE-5652-v5.patch

Thanks for the review, Karthik!

bq. How do we handle applications that finish while the NM is down?

The RM is already telling the NM about finished applications, and since it's not creating
a new RMNodeImpl when it rejoins it should continue to tell it about subsequent finished applications
since the last time the node heartbeated.

There may be some races in there where the RM thinks the NM processed the heartbeat response
but the NM crashed before it completed processing it.  If that's an issue we could fix it
by having the NM send the list of apps it thinks are still active when it re-registers, giving
the RM an opportunity to correct it on the next heartbeat.  However that's outside of the
scope (and project) of this JIRA.  I'll try to address that in YARN-1354 or possibly a separate

bq. Code related to initStateStore should ideally go into serviceInit(), primarily to future-proof
against us supporting (re)starting stopped services.

serviceStart() indirectly calls initStateStore to open the database, and serviceStop() closes
the database.  If we want later to support restarting stopped services then we need to continue
to open the database in serviceStart().  Moving the initStateStore call to init rather than
start means we will try to use a closed database after the service is restarted, or am I missing

bq. Use the constant JOB here?

Fixed.  I had to change its visibility to public in order to access it.

bq. ShuffleHandler#recordJobShuffleInfo: addJobToken() should come after attempt to include
in the store?


bq. ShuffleHandler#close() should probably take care of clearing the static maps. Alternately,
we could just make those maps non-static.

I made userRsrc and secretManager regular members rather than static.  Originally I wasn't
sure why they were static and didn't want to mess with that too much as part of this JIRA,
but it's problematic for testing this.  Apparently they always were static, but I couldn't
find any reason for them to be so.  Even in light of multiple ShuffleHandler instances, I
don't see why it's something we need (or necessarily even want) to share.  Manually ran the
unit tests under hadoop-mapreduce-client-jobclient to verify there wasn't something fishy
going on with the multi-instance ShuffleHandler in mini clusters or something like that.

bq. ShuffleHander#forgetJob() - should we make those two maps non-static?

Removed forgetJob() now that the members are not static.

bq. Do we need to change hadoop-mapreduce-project/pom.xml, given we already add the dependencies
in the shuffle module?

Yes, if I remove the dependency from that pom.xml then the leveldb jar doesn't show up in
the resulting dist tree under mapreduce/lib/.  Maybe may need the equivalent of YARN-888 for
hadoop-mapreduce-project poms to only declare them in the leaf modules and still have them
be picked up properly.

> NM Recovery. ShuffleHandler should handle NM restarts
> -----------------------------------------------------
>                 Key: MAPREDUCE-5652
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Jason Lowe
>              Labels: shuffle
>         Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, MAPREDUCE-5652-v4.patch,
MAPREDUCE-5652-v5.patch, MAPREDUCE-5652.patch
> ShuffleHandler should work across NM restarts and not require re-running map-tasks. On
NM restart, the map outputs are cleaned up requiring re-execution of map tasks and should
be avoided.

This message was sent by Atlassian JIRA

View raw message