hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3824) Distributed caches are not removed properly
Date Tue, 07 Feb 2012 16:06:59 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13202496#comment-13202496

Robert Joseph Evans commented on MAPREDUCE-3824:

I like the concept of the patch.  Volatile is definitely needed here, my bad on that one.
 I also like that you are doing a DU to update the size of the cached objects if they are
0.  I do have some issues with the patch though.

The first is that even though the DU size update is being done on a separate thread it is
being done with the cachedArchives lock held.  The amount of time it takes to do a DU could
be significant.  Nothing new can be added to the cache while the cachedArchives lock is held,
so it could be blocking other new tasks from making progress.  I would really prefer to see
this done in two passes, similar to how we delete out entries.  The first pass would go through
all entries and identify those that need to be updated, the second pass would be to update
those entries without the lock held.  Then once we have all of the entries updated we can
look at cleaning up the distributed cache. 

The second is that we are updating the size too late.  We decide how much space needs to be
deleted to get us back under the desired amount based totally on the size reported by BaseDirManager,
which in turn gets its data from the CacheStatus object.  The issue is that in the current
patch we first calculate how much needs to be removed, then we update the size of the archives,
then we delete them.  This is not that critical, because it just means that in the next pass
they would be deleted, so this is really very minor, but should be covered by doing the update
in two passes.

I am not sure exactly what are the situations that the size is not being set.  I would like
to know exactly which situations the current code is missing, because like I said previously
the code that computes the used size goes completely off of what is reported to BaseDirManager,
unfortunately there are some issues with BaseDirManger where if we are too aggressive with
setting the size we might double count some archives, which eventually would make it so that
the BaseDirManager thinks it is full all the time, which would be very bad.

> Distributed caches are not removed properly
> -------------------------------------------
>                 Key: MAPREDUCE-3824
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3824
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: distributed-cache
>    Affects Versions: 1.0.0
>            Reporter: Allen Wittenauer
>            Priority: Critical
>         Attachments: MAPREDUCE-3824-branch-1.0.txt
> Distributed caches are not being properly removed by the TaskTracker when they are expected
to be expired. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message