hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
Date Wed, 13 May 2015 15:03:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542052#comment-14542052
] 

Hadoop QA commented on YARN-3641:
---------------------------------

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 39s | Pre-patch trunk compilation is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any @author tags.
|
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear to include any
new or modified tests.  Please justify why no new tests are needed for this patch. Also please
list what manual steps were performed to verify this patch. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning messages. |
| {color:green}+1{color} | javadoc |   9m 39s | There were no new javadoc warning messages.
|
| {color:green}+1{color} | release audit |   0m 23s | The applied patch does not increase
the total number of release audit warnings. |
| {color:green}+1{color} | checkstyle |   0m 35s | There were no new checkstyle issues. |
| {color:red}-1{color} | whitespace |   0m  0s | The patch has 1  line(s) that end in whitespace.
Use git apply --whitespace=fix. |
| {color:green}+1{color} | install |   1m 34s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with eclipse:eclipse.
|
| {color:green}+1{color} | findbugs |   1m  3s | The patch does not introduce any new Findbugs
(version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   6m  0s | Tests passed in hadoop-yarn-server-nodemanager.
|
| | |  42m  5s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | http://issues.apache.org/jira/secure/attachment/12732578/YARN-3641.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 065d8f2 |
| whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/whitespace.txt
|
| hadoop-yarn-server-nodemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/7921/artifact/patchprocess/testrun_hadoop-yarn-server-nodemanager.txt
|
| Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7921/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep
3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7921/console |


This message was automatically generated.

> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping
NM's sub-services.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3641
>                 URL: https://issues.apache.org/jira/browse/YARN-3641
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, rolling upgrade
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM restart
with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException:
IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource
temporarily unavailable
> 	at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK:
Resource temporarily unavailable
> 	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> 	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> 	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103
> ************************************************************/
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
>     if (isStopping.getAndSet(true)) {
>       return;
>     }
>     super.serviceStop();
>     stopRecoveryStore();
>     DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService,
ResourceLocalizationService, etc.) first. Any of services get stopped with exception could
cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next
time NM start, it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message