hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Bockelman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
Date Tue, 24 Feb 2009 18:33:01 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676355#action_12676355

Brian Bockelman commented on HADOOP-4584:

Hey Raghu,

- Regarding your above point about periodic block verification handling the various things
that can go wrong with a block:  Currently, it's woefully insufficient, especially on large
data noes, to replace the directory scan.  If we wait 3 weeks (or several months for some
of our large nodes) before we find a block is missing, we're going to see lots and lots of
issues crop up!

- I have seen the 'rm -r' in practice, by the way :).

- With a reasonably sized block, we've had 48TB servers be able to only take a few minutes
for a scan: no heartbeats lost.  That said, I do like your argument that the DN should handle
things to the best of its abilities and not die. 

I like the idea of the patch, but only if it's combined with an occasional offline scan (even
once a day!).  Creeping inconsistency bugs in the NN seem to make very accurate block reports
a precious commodity, one that I'd gladly pay an expensive scan for (though I agree that once
an hour is probably excessive).

> Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
> ----------------------------------------------------------------------------------------
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>         Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch
> sometimes due to disk or some other problems, datanode takes minutes or tens of minutes
to generate a block report. It causes the datanode not able to send heartbeat to NameNode
every 3 seconds. In the worst case, it makes NameNode to detect a lost heartbeat and wrongly
decide that the datanode is dead.
> It would be nice to have two threads instead. One thread is for scanning data directories
and generating block report, and executes the requests sent by NameNode; Another thread is
for sending heartbeats, block reports, and picking up the requests from NameNode. By having
these two threads, the sending of heartbeats will not get delayed by any slow block report
or slow execution of NameNode requests.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message