hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2873) improve LevelDB error handling for missing files DBException to avoid NM start failure.
Date Tue, 18 Nov 2014 15:03:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216259#comment-14216259
] 

Jason Lowe commented on YARN-2873:
----------------------------------

I have some serious concerns about this approach.  As I mentioned during the discussions on
YARN-2816, this is trying to recover from a completely invalid setup.  If something is coming
along and deleting (i.e.: corrupting) parts of the database then _that_ is the problem that
needs to be corrected rather than worked around in the NM.  Reaching into the internals of
the leveldb files and assuming we can just delete some files and the database can open isn't
a general solution.  At that point arbitrary state has been lost, potentially entire container/application
lifecycles, and who knows what will happen.

Rather than assume we know how leveldb internals work (which could completely change if we
upgrade the leveldb dependency and invalidate our assumptions), we should use JniDBFactory.factory.repair
to try to repair the database rather than delete files here and there ourselves.  Arguably
if leveldb's own repair doesn't work and we're insistent that the NM must come up at all costs
then we should just nuke the database and start without state.  Of course the log should be
filled with all sorts of errors to indicate this was in no way a normal startup.

> improve LevelDB error handling for missing files DBException to avoid NM start failure.
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-2873
>                 URL: https://issues.apache.org/jira/browse/YARN-2873
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2873.000.patch, YARN-2873.001.patch
>
>
> improve LevelDB error handling for missing files DBException to avoid NM start failure.
> We saw the following three level DB exceptions, all these exceptions cause NM start failure.
> DBException 1 in ShuffleHandler
> {code}
> INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
failed in state STARTED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException:
Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException:
Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> 	at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceStart(AuxServices.java:159)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:441)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:261)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:446)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing
files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/nm-aux-services/mapreduce_shuffle/mapreduce_shuffle_state/000005.sst
> 	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> 	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> 	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> 	at org.apache.hadoop.mapred.ShuffleHandler.startStore(ShuffleHandler.java:475)
> 	at org.apache.hadoop.mapred.ShuffleHandler.recoverState(ShuffleHandler.java:443)
> 	at org.apache.hadoop.mapred.ShuffleHandler.serviceStart(ShuffleHandler.java:379)
> 	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> 	... 10 more
> {code}
> DBException 2 in NMLeveldbStateStoreService:
> {code}
> Error starting NodeManager 
> org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException:
Corruption: 1 missing files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst

> at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)

> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) 
> at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)

> at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)

> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) 
> at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)

> at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) 
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 1 missing
files; e.g.: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/000005.sst 
> at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) 
> at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) 
> at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) 
> at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)

> at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)

> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> {code}
> DBException 3 in NMLeveldbStateStoreService:
> {code}
> INFO	org.apache.hadoop.service.AbstractService	
> Service org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService
failed in state INITED; cause: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO
error: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004: No such file or directory
> org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state/MANIFEST-000004:
No such file or directory
> 	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> 	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> 	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:842)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:195)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:152)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:190)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> {code}
> DBException 1 and 2 is due to Sorted table file 000005.sst  being deleted accidentally.
> DBException 3 is due to MANIFEST being deleted accidentally.
> It would be better to handle these errors instead of  NM failed to start with DBException.
> For these DBExceptions, if we delete the LevelDB text file CURRENT, NM will recover successfully
from the DBException.
> CURRENT is a simple text file that contains the name of the latest MANIFEST file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message