hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3751) DN should log warnings for lengthy disk IOs
Date Thu, 02 Aug 2012 16:30:02 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427421#comment-13427421
] 

Todd Lipcon commented on HDFS-3751:
-----------------------------------

Hey Bobby. We recently added metrics for these timings (HDFS-3170) and now calculate quantiles
for them as well (HDFS-3650). I agree it would be nice to track them dynamically per mount,
but I think that's a bit more complicated than the simple warning proposed here.

We used a hacked up version of this proposed patch on a customer workload, and even the really
simple logging was super helpful. Most people already have a way of grepping logs for certain
key warning messages to trigger alerts, so even without Hadoop-side support for aggregating
and counting the metrics, I think this should go in. Then let's file a separate JIRA to collect
per-disk metrics using the metrics2 dynamic metrics support.
                
> DN should log warnings for lengthy disk IOs
> -------------------------------------------
>
>                 Key: HDFS-3751
>                 URL: https://issues.apache.org/jira/browse/HDFS-3751
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>    Affects Versions: 1.2.0, 2.1.0-alpha
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>
> Occasionally failing disks or other OS-and-below issues cause a single IO to take tens
of seconds, or even minutes in the case of failures. This often results in timeout exceptions
at the client side which are hard to diagnose. It would be easier to root-cause these issues
if the DN logged a WARN like "IO of 64kb to volume /data/1/dfs/dn for block 12345234 client
1.2.3.4 took 61.3 seconds" or somesuch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message