hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-3797) NodeManager not blacklisting the disk (shuffle) with errors
Date Thu, 11 Jun 2015 09:17:01 GMT
Rajesh Balamohan created YARN-3797:
--------------------------------------

             Summary: NodeManager not blacklisting the disk (shuffle) with errors
                 Key: YARN-3797
                 URL: https://issues.apache.org/jira/browse/YARN-3797
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Rajesh Balamohan


In a multi-node environment, one of the disk (where map outputs are written) in a node went
bad. Errors are given below.

{noformat}
Info fld=0x9ad090a
sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00
end_request: critical medium error, dev sdf, sector 162334984
mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
Info fld=0x9af8892
sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
end_request: critical medium error, dev sdf, sector 162498704
mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
Info fld=0x9af8892
sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
end_request: critical medium error, dev sdf, sector 162498704
{noformat}

Diskchecker would pass as the system allows to create directories and delete directories without
issues.  But data being served out can be corrupt and fetchers fail during CRC verification
with unwanted delays and retries. 

Ideally node manager should detect such errors and blacklist/remove those disks from NM.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message