From yarn-issues-return-110424-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Tue Mar 21 01:08:55 2017 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6119E19C5C for ; Tue, 21 Mar 2017 01:08:55 +0000 (UTC) Received: (qmail 25458 invoked by uid 500); 21 Mar 2017 01:08:55 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 25419 invoked by uid 500); 21 Mar 2017 01:08:55 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 25408 invoked by uid 99); 21 Mar 2017 01:08:55 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Mar 2017 01:08:55 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B7E1C1A07D6 for ; Tue, 21 Mar 2017 01:08:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.453 X-Spam-Level: * X-Spam-Status: No, score=1.453 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id qIf1e_SBy8dS for ; Tue, 21 Mar 2017 01:08:53 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 52AD360D30 for ; Tue, 21 Mar 2017 01:08:51 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AF512E02F1 for ; Tue, 21 Mar 2017 01:08:49 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id ACF0B254DD for ; Tue, 21 Mar 2017 01:08:41 +0000 (UTC) Date: Tue, 21 Mar 2017 01:08:41 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-6302) Fail the node, if Linux Container Executor is not configured properly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933925#comment-15933925 ] ASF GitHub Bot commented on YARN-6302: -------------------------------------- Github user szegedim commented on a diff in the pull request: https://github.com/apache/hadoop/pull/200#discussion_r107053877 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeHealthCheckerService.java --- @@ -54,22 +58,35 @@ protected void serviceInit(Configuration conf) throws Exception { * @return the reporting string of health of the node */ String getHealthReport() { + String healthReport = ""; String scriptReport = (nodeHealthScriptRunner == null) ? "" : nodeHealthScriptRunner.getHealthReport(); - if (scriptReport.equals("")) { - return dirsHandler.getDisksHealthReport(false); - } else { - return scriptReport.concat(SEPARATOR + dirsHandler.getDisksHealthReport(false)); + String discReport = dirsHandler.getDisksHealthReport(false); + String exceptionReport = nodeHealthException != null ? + nodeHealthException.getMessage() : ""; + + if (!scriptReport.equals("")) { + healthReport = scriptReport; + } + if (!discReport.equals("")) { + healthReport = healthReport.equals("") ? discReport : + healthReport.concat(SEPARATOR + discReport); } + if (!exceptionReport.equals("")) { + healthReport = healthReport.equals("") ? exceptionReport : + healthReport.concat(SEPARATOR + exceptionReport); + } + return healthReport; } /** * @return true if the node is healthy */ boolean isHealthy() { - boolean scriptHealthStatus = (nodeHealthScriptRunner == null) ? true - : nodeHealthScriptRunner.isHealthy(); - return scriptHealthStatus && dirsHandler.areDisksHealthy(); + boolean scriptHealthStatus = nodeHealthScriptRunner == null || --- End diff -- Done. > Fail the node, if Linux Container Executor is not configured properly > --------------------------------------------------------------------- > > Key: YARN-6302 > URL: https://issues.apache.org/jira/browse/YARN-6302 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Miklos Szegedi > Assignee: Miklos Szegedi > Priority: Minor > > We have a cluster that has one node with misconfigured Linux Container Executor. Every time an AM or regular container is launched on the cluster, it will fail. The node will still have resources available, so it keeps failing apps until the administrator notices the issue and decommissions the node. AM Blacklisting only helps, if the application is already running. > As a possible improvement, when the LCE is used on the cluster and a NM gets certain errors back from the LCE, like error 24 configuration not found, we should not try to allocate anything on the node anymore or shut down the node entirely. That kind of problem normally does not fix itself and it means that nothing can really run on that node. > {code} > Application application_1488920587909_0010 failed 2 times due to AM Container for appattempt_1488920587909_0010_000002 exited with exitCode: -1000 > Failing this attempt.Diagnostics: Application application_1488920587909_0010 initialization failed (exitCode=24) with output: > For more detailed output, check the application tracking page: http://node-1.domain.com:8088/cluster/app/application_1488920587909_0010 Then click on links to logs of each attempt. > . Failing the application. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org