hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3797) NodeManager not blacklisting the disk (shuffle) with errors
Date Wed, 17 Jun 2015 00:14:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-3797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589070#comment-14589070
] 

Allen Wittenauer commented on YARN-3797:
----------------------------------------

This is the type of problem where one would use the node health check script. 

> NodeManager not blacklisting the disk (shuffle) with errors
> -----------------------------------------------------------
>
>                 Key: YARN-3797
>                 URL: https://issues.apache.org/jira/browse/YARN-3797
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Rajesh Balamohan
>
> In a multi-node environment, one of the disk (where map outputs are written) in a node
went bad. Errors are given below.
> {noformat}
> Info fld=0x9ad090a
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 ad 09 08 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162334984
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x9af8892
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162498704
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> mpt2sas0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
> sd 6:0:5:0: [sdf]  Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
> sd 6:0:5:0: [sdf]  Sense Key : Medium Error [current]
> Info fld=0x9af8892
> sd 6:0:5:0: [sdf]  Add. Sense: Unrecovered read error
> sd 6:0:5:0: [sdf] CDB: Read(10): 28 00 09 af 88 90 00 00 08 00
> end_request: critical medium error, dev sdf, sector 162498704
> {noformat}
> Diskchecker would pass as the system allows to create directories and delete directories
without issues.  But data being served out can be corrupt and fetchers fail during CRC verification
with unwanted delays and retries. 
> Ideally node manager should detect such errors and blacklist/remove those disks from
NM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message