hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6252) JobHistoryServer should not fail when encountering a missing directory
Date Mon, 27 Apr 2015 15:31:42 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514296#comment-14514296

Hudson commented on MAPREDUCE-6252:

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #177 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/177/])
MAPREDUCE-6252. JobHistoryServer should not fail when encountering a (devaraj: rev 5e67c4d384193b38a85655c8f93193596821faa5)
* hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/HistoryFileManager.java
* hadoop-mapreduce-project/CHANGES.txt
* hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/test/java/org/apache/hadoop/mapreduce/v2/hs/TestHistoryFileManager.java

> JobHistoryServer should not fail when encountering a missing directory
> ----------------------------------------------------------------------
>                 Key: MAPREDUCE-6252
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6252
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.6.0
>            Reporter: Craig Welch
>            Assignee: Craig Welch
>             Fix For: 2.8.0
>         Attachments: MAPREDUCE-6252.0.patch, MAPREDUCE-6252.1.patch
> The JobHistoryServer maintains a cache of job serial number parts to dfs paths which
it uses when seeking a job it no longer has in its memory cache, multiple directories for
a given serial number differentiated by time stamp.  At present the jobhistory server will
fail any time it attempts to find a job in a directory which no longer exists based on that
cache - even though the job may well exist in a different directory for the serial number.
 Typically this is not an issue, but the history cleanup process removes the directory from
dfs before removing it from the cache which leaves a window of time where a directory may
be missing from dfs which is present in the cache, resulting in failure.  For some dfs's it
appears that the top level directory may become unavailable some time before the full deletion
of the tree completes which extends what might otherwise be a brief period of failure to a
more extended period.  Further, this also places the service at the mercy of outside processes
which might remove those directories.  The proposal is simply to make the server resistant
to this state such that encountering this missing directory is not fatal and the process will
continue on to seek it elsewhere.

This message was sent by Atlassian JIRA

View raw message