hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Welch (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAPREDUCE-6251) JobClient needs additional retries at a higher level to address not-immediately-consistent dfs corner cases
Date Tue, 10 Feb 2015 21:58:11 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Craig Welch updated MAPREDUCE-6251:
-----------------------------------
    Attachment: MAPREDUCE-6251.0.patch

Attached is a patch which locates the retry where it is effective in capturing the state and
which provides a configurable retry count/interval which will address this issue for most
reasonable "eventual consistency" timeframes.  Without chaning the overall handoff mechanism
to not be based on DFS this is the best type of fix I believe we can achieve.  Moving to synchronous
calls to report history to the JH is another option I think we should consider, but that is
a more significant change I think we will want to consider down the road - in the meantime
this should work around the issue for most cases.

> JobClient needs additional retries at a higher level to address not-immediately-consistent
dfs corner cases
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6251
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6251
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver, mrv2
>    Affects Versions: 2.2.0
>            Reporter: Craig Welch
>            Assignee: Craig Welch
>         Attachments: MAPREDUCE-6251.0.patch
>
>
> The JobClient is used to get job status information for running and completed jobs. 
Final state and history for a job is communicated from the application master to the job history
server via a distributed file system - where the history is uploaded by the application master
to the dfs and then scanned/loaded by the jobhistory server.  While HDFS has strong consistency
guarantees not all Hadoop DFS's do.  When used in conjunction with a distributed file system
which does not have this guarantee there will be cases where the history server may not see
an uploaded file, resulting in the dreaded "no such job" and a null value for the RunningJob
in the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message