Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1B8FB17336 for ; Thu, 11 Jun 2015 09:17:02 +0000 (UTC) Received: (qmail 24149 invoked by uid 500); 11 Jun 2015 09:17:01 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 24076 invoked by uid 500); 11 Jun 2015 09:17:01 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 23795 invoked by uid 99); 11 Jun 2015 09:17:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Jun 2015 09:17:01 +0000 Date: Thu, 11 Jun 2015 09:17:01 +0000 (UTC) From: "Rajesh Balamohan (JIRA)" To: yarn-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (YARN-3797) NodeManager not blacklisting the disk (shuffle) with errors MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Rajesh Balamohan created YARN-3797: -------------------------------------- Summary: NodeManager not blacklisting the disk (shuffle) with errors Key: YARN-3797 URL: https://issues.apache.org/jira/browse/YARN-3797 Project: Hadoop YARN Issue Type: Bug Reporter: Rajesh Balamohan In a multi-node environment, one of the disk (where map outputs are written) in a node went bad. Errors are given below. {noformat} Info fld=0x9ad090a sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00 end_request: critical medium error, dev sdf, sector 162334984 mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) sd 6:0:5:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 6:0:5:0: [sdf] Sense Key : Medium Error [current] Info fld=0x9af8892 sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00 end_request: critical medium error, dev sdf, sector 162498704 mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) sd 6:0:5:0: [sdf] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 6:0:5:0: [sdf] Sense Key : Medium Error [current] Info fld=0x9af8892 sd 6:0:5:0: [sdf] Add. Sense: Unrecovered read error sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00 end_request: critical medium error, dev sdf, sector 162498704 {noformat} Diskchecker would pass as the system allows to create directories and delete directories without issues. But data being served out can be corrupt and fetchers fail during CRC verification with unwanted delays and retries. Ideally node manager should detect such errors and blacklist/remove those disks from NM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)