hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15493) DiskChecker should handle disk full situation
Date Mon, 02 Jul 2018 20:36:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530413#comment-16530413

Arpit Agarwal commented on HADOOP-15493:

{quote}I think we have to rely on the system to detect a failed controller/drive. Maybe we
should just attempt to provoke the disk to go read-only. Have the DN periodically write a
file to its storages every n-many mins – but take no action upon failure. Instead rely on
the normal disk check to subsequently discover the disk is read-only.
When you say 'we have to rely on the system', do you mean the OS?

We saw disk failures (and rarely controller failures) go undetected indefinitely. Application
requests would fail and trigger disk checker which always succeeded. We had customers hit
data loss after multiple undetected disk failures over a few days.

{quote}I don't think this disk-is-writable check should be in common.
We can make the write check HDFS-internal. We still need a disk full check. Perhaps the safest
option is a threshold which avoids false positives and allows false negatives.

> DiskChecker should handle disk full situation
> ---------------------------------------------
>                 Key: HADOOP-15493
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15493
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HADOOP-15493.01.patch, HADOOP-15493.02.patch
> DiskChecker#checkDirWithDiskIo creates a file to verify that the disk is writable.
> However check should not fail when file creation fails due to disk being full. This avoids
marking full disks as _failed_.
> Reported by [~kihwal] and [~daryn] in HADOOP-15450. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message