hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lei (Eddy) Xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8694) Expose the stats of IOErrors on each FsVolume through JMX
Date Wed, 15 Jul 2015 17:11:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628371#comment-14628371

Lei (Eddy) Xu commented on HDFS-8694:

Thanks for the reviews, [~andrew.wang]

bq. I have a hard time understanding when we should call handle the disk error vs. just bubbling
up, since it bubbles there seems like a danger of handling the same root IOE more than once.
What's the methodology here? Is it possible to move handling to the top-level somewhere? I
can manually examine all the current callsites and callers, but that's not very future-proof.

The reason that call {{volume#handleIOErrors()}} is that when the {{IOE}} pops up to the place
we used to call {{DataNode#checkDiskErrorAsync()}}, the context (IOs on which volume) is usually
missing. My intention was to call {{volume#handleIOErrors()}} at the highest level that manages
{{volume}} object lifetime. I will try to get rid of {{DataNode#checkDiskErrorAsync()}} call
in a following JIRA.

bq. Since we now have the volume as context, we should really move the disk checker to be
per-volume rather than DN wide. One volume throwing an error is no reason to check all of
them. This can be deferred to a follow-up; I think it's a slam dunk.

Yes. It is the reason to put {{hadnleIOErrors()}} in to {{FsVolumeSpi}}. I was thinking to
use a per-volume thread to do {{checkDirs()}} and also use {{numOfErrors()}} as trigger. I
will do it in a following JIRA as well.

Working on the rest of comments.

Thanks a lot for these great comments.

> Expose the stats of IOErrors on each FsVolume through JMX
> ---------------------------------------------------------
>                 Key: HDFS-8694
>                 URL: https://issues.apache.org/jira/browse/HDFS-8694
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, HDFS
>    Affects Versions: 2.7.0
>            Reporter: Lei (Eddy) Xu
>            Assignee: Lei (Eddy) Xu
>         Attachments: HDFS-8694.000.patch, HDFS-8694.001.patch
> Currently, once DataNode hits an {{IOError}} when writing / reading block files, it starts
a background {{DiskChecker.checkDirs()}} thread. But if this thread successfully finishes,
DN does not record this {{IOError}}. 
> We need one measurement to count all {{IOErrors}} for each volume.

This message was sent by Atlassian JIRA

View raw message