hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anu Engineer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13738) DiskChecker should perform some disk IO
Date Tue, 25 Oct 2016 18:08:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606010#comment-15606010
] 

Anu Engineer commented on HADOOP-13738:
---------------------------------------

[~arpitagarwal] Thank you for these improvements. I am sure that these are going to make detecting
errors on datanode much easier.

Had some minor comments / questions.
# Not sure what diskchecker.random buys us versus diskchecker.1, diskchecker.2 or diskchecker.3.
These files are always deleted after use and in a failure case seeing the sequence number
of diskchecker files might be helpful. so I am not sure why we need random at all here.
# As [~kihwal] said, just wanted to think though the failure. I can think of 3 distinct failure
cases. 
## Not able to create a file at all -- You can try 3 times and come out, may be as [~kihwal]
said it will take us 6 mins to get out.
## Creation works, but I/O and delete fails -- In this case disk I/O failure is propagated
but the junk files remain. Since disk checker will flag the disk is having an issue this case
is not problematic.
## File creation and I/O works, but delete fails. We seem to be using {{FileUtils.deleteQuietly}},
shouldn't diskchecker be able to understand the delete operation failed ? Also in this scenario,
if we have both random files and delete failures, we might create way too many junk files.
If you use dc.1, dc.2, dc.3 -- we might be able to restrict junk files to 3

> DiskChecker should perform some disk IO
> ---------------------------------------
>
>                 Key: HADOOP-13738
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13738
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>         Attachments: HADOOP-13738.01.patch, HADOOP-13738.02.patch, HADOOP-13738.03.patch
>
>
> DiskChecker can fail to detect total disk/controller failures indefinitely. We have seen
this in real clusters. DiskChecker performs simple permissions-based checks on directories
which do not guarantee that any disk IO will be attempted.
> A simple improvement is to write some data and flush it to the disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message