hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly
Date Tue, 07 Feb 2012 18:44:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202612#comment-13202612

Allen Wittenauer commented on MAPREDUCE-3824:

There is no doubt the patch is a hack, but it solved my immediate problems because as it stands,
distributed caches are really broken at scale.

Some background.  I have a team of users that have several 36GB distributed caches. When these
caches are in play, most of the system is basically locked while these caches get built. 
This patch was really geared towards making sure that these massive caches at least get deleted.
 Without these patches in place, the mapred tmp spaces fill and tasks fail, eventually leading
to mapred framework collapse. 

There are a lot of other problems that show up with caches this large:
* Hadoop doesn't have a size limit check on caches as part of the job submission process [So
any hand waving about "don't use caches that big!" are null and void since there is no way
to actually stop a user from doing that!]
* the setup and cleanup tasks also trigger cache downloads.
* tasktrackers appear to be frozen for *all* tasks during cache downloads, with the task stuck
in the extremely unhelpful "unassigned" state.
* the methodology of updating the private cache as a different step seems unnecessary given
the permissions at the file system level.

What really needs to happen is a massive overhaul of the entire distributed cache system.
 But that's a bigger project, preferably for someone who gets paid to do hadoop development
full time.  So, like all of the patches I've been submitting lately, I'm not expecting them
to get committed. But this is enough of a patch for someone who needs a useable system until
a working release ships.
> Distributed caches are not removed properly
> -------------------------------------------
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
> Distributed caches are not being properly removed by the TaskTracker when they are expected
to be expired. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message