hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6011) Improve history server behavior during a recovery error
Date Mon, 28 Jul 2014 18:06:38 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076493#comment-14076493
] 

Jason Lowe commented on MAPREDUCE-6011:
---------------------------------------

Sample error where a bad token state failed history server startup but didn't explain which
file contained the bad token state:

{noformat}
2014-07-11 22:51:14,977 [main] INFO impl.MetricsSystemImpl: JobHistoryServer metrics system
started
2014-07-11 22:51:16,079 [main] INFO hs.HistoryServerFileSystemStateStoreService: Loading history
server state from hdfs:/xx
2014-07-11 22:51:46,747 [main] INFO service.AbstractService: Service org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer$HistoryServerSecretManagerService
failed in state STARTED; cause: java.io.EOFException
java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:267)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.readFields(AbstractDelegationTokenIdentifier.java:179)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadToken(HistoryServerFileSystemStateStoreService.java:295)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokensFromBucket(HistoryServerFileSystemStateStoreService.java:314)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokens(HistoryServerFileSystemStateStoreService.java:353)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokenState(HistoryServerFileSystemStateStoreService.java:367)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadState(HistoryServerFileSystemStateStoreService.java:114)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer$HistoryServerSecretManagerService.serviceStart(JobHistoryServer.java:89)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.serviceStart(JobHistoryServer.java:194)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:220)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:229)
2014-07-11 22:51:46,749 [main] INFO service.AbstractService: Service org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
failed in state STARTED; cause: org.apache.hadoop.service.ServiceStateException: java.io.EOFException
org.apache.hadoop.service.ServiceStateException: java.io.EOFException
        at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
        at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.serviceStart(JobHistoryServer.java:194)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.launchJobHistoryServer(JobHistoryServer.java:220)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer.main(JobHistoryServer.java:229)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:267)
        at org.apache.hadoop.security.token.delegation.AbstractDelegationTokenIdentifier.readFields(AbstractDelegationTokenIdentifier.java:179)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadToken(HistoryServerFileSystemStateStoreService.java:295)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokensFromBucket(HistoryServerFileSystemStateStoreService.java:314)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokens(HistoryServerFileSystemStateStoreService.java:353)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadTokenState(HistoryServerFileSystemStateStoreService.java:367)
        at org.apache.hadoop.mapreduce.v2.hs.HistoryServerFileSystemStateStoreService.loadState(HistoryServerFileSystemStateStoreService.java:114)
        at org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer$HistoryServerSecretManagerService.serviceStart(JobHistoryServer.java:89)
        at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
        ... 5 more
2014-07-11 22:51:46,750 [main] INFO impl.MetricsSystemImpl: Stopping JobHistoryServer metrics
system...
{noformat}

Note the lack of details on which token was being loaded.  Also the log should be at at least
at the WARN level if we let the JHS continue past this error or at least the ERROR log level
if it remains fatal to starting up.

> Improve history server behavior during a recovery error
> -------------------------------------------------------
>
>                 Key: MAPREDUCE-6011
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6011
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobhistoryserver
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>
> Currently when the history server encounters an error during recovery it is fatal without
specific details on the error (e.g. which token was involved during the recovery error). 
We should either allow the history server to proceed past recovery errors or provide more
specifics on the offending token involved in the fatal error to aid in manual recovery.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message