hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
Date Fri, 27 Mar 2009 21:38:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12690118#action_12690118

Konstantin Shvachko commented on HADOOP-4584:

I am commenting on the design document.
It seems that you can simplify the description of the algorithm. As I understood you generate
2 reports memory_report and disk_report, then compare them and generate a (diff) list of suspicious
blocks. They are only suspicious, since they were different at the time the reports were generated,
which may be not true at the current time. And then for each suspicious block you reconcile
it under a lock in order to prevent immediate modifications of the block state.
To simplify the algorithm you can completely drop the conditions reflecting the state of the
block in the past when it was chosen as suspicious. The past state is irrelevant in the present
because you still need to verify the state and act according to its present state rather than
the past.
I see the code in fact does exactly that.

Other comments:
- I don't think the directory scan interval in hdfs-default.xml should be in hours. This is
radical. At least for testing you should be able to run the directory scanner more often.
- {{DirectoryScanner()}} constructor and {{reconcile()}} should not be public. Please check
other methods that do not need to be public.
- It is better to give a hint in the override annotation which base class is overridden, e.g.
{{@Override // Object}}
- {{FSDataset.checkAndUpdate()}} You can make it much more readable if you add return statements
inside if statements. This will let you drop a lot of else clauses and linearize the code
making the logic clearer.

> Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
> ----------------------------------------------------------------------------------------
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>         Attachments: 4584.brthread.2.patch, 4584.brthread.3.patch, 4584.brthread.3.patch,
4584.brthread.3.patch, 4584.brthread.3.patch, 4584.brthread.3.patch, 4584.hbthread.patch,
4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch, 4584.patch, Design.pdf
> sometimes due to disk or some other problems, datanode takes minutes or tens of minutes
to generate a block report. It causes the datanode not able to send heartbeat to NameNode
every 3 seconds. In the worst case, it makes NameNode to detect a lost heartbeat and wrongly
decide that the datanode is dead.
> It would be nice to have two threads instead. One thread is for scanning data directories
and generating block report, and executes the requests sent by NameNode; Another thread is
for sending heartbeats, block reports, and picking up the requests from NameNode. By having
these two threads, the sending of heartbeats will not get delayed by any slow block report
or slow execution of NameNode requests.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message