hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andy Isaacson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3590) Print a WARN if the edit log sync period takes more than X time units
Date Mon, 02 Jul 2012 21:28:39 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405333#comment-13405333
] 

Andy Isaacson commented on HDFS-3590:
-------------------------------------

I'm +1 on the concept of logging a message when IO is slow; I've used such log messages successfully
in the past to diagnose system problems.

At 5 seconds we'll see lots of log messages from systems with just generally slow IO systems.
 It only takes 500 requests queued in front of you to delay you for 5 seconds (or just one
media error with firmware retry).  This is fine as a log message (it helps diagnose slowness)
but a 5 second delay does not justify a warning or error.

At 60 seconds we would probably not see any false positives and a warning or error would be
reasonable.

The message should be rate-limited (you don't want your log messages to generate additional
IO load causing the problem to get worse) and should include the actual elapsed time to 1ms
accuracy if possible.
                
> Print a WARN if the edit log sync period takes more than X time units
> ---------------------------------------------------------------------
>
>                 Key: HDFS-3590
>                 URL: https://issues.apache.org/jira/browse/HDFS-3590
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: Harsh J
>            Priority: Minor
>
> If an logSync operation, which happens for calls such as FS#create() after the edit has
been made at the NN metadata, takes longer than X seconds (I'd say if it took more than a
minute, there's something really wrong with the volume it probably got stuck on), we should
log a WARN with the volume that may have particularly caused it. This helps track down, if
an NN runs with multiple NFS volumes, which particular volume may have caused it, as there's
no per-NN-dir metrics of any kind.
> I ran into a situation today where a hard-mounted NFS point hung for over X minutes but
there was no indication in NN's logs after it recovered (recovering so late caused its own
slew of issues for which I'll file other improvement JIRAs) that such an event happened, aside
of the Sync (Journal Sync) metric spiking with the elapsed sync time value rising up. A log
would have helped save time investigating this, and possibly would have also pin-pointed the
bad location more accurately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message