hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4584) Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
Date Fri, 06 Feb 2009 19:07:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671244#action_12671244
] 

Suresh Srinivas commented on HADOOP-4584:
-----------------------------------------

1. {{dataAvailable.wait (time till next report);}}

because currently the loop exit happens when {{shouldRun}} is to {{false}} or {{shutdown()}}
is called. Assuming that "time till next report" is time till next block report, the {{offerService()}}
does not end for a long time. This delays shutdown in some of the cases and results in few
unit testcase failures that assumes datanode shuts down quickly. Alternatively we could notify
{{dataAvailable}} when {{shouldRun}} is set to false. But I think that makes the code quite
ugly. Hence the 1 second {{wait}} time.

2. Using dataAvailable for synchronizing commandQ and receivedBlockList
I think currently dataAvailable is used to wake up a thread that is waiting for either a command
or a received block. Data can be added to commandQ or receivedBlock independent of each other.
Doing this assumes that a command cannot added while received block is being added to receivedBlockList.



> Slow generation of blockReport at DataNode causes delay of sending heartbeat to NameNode
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>
>         Attachments: 4584.patch, 4584.patch, 4584.patch
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens of minutes
to generate a block report. It causes the datanode not able to send heartbeat to NameNode
every 3 seconds. In the worst case, it makes NameNode to detect a lost heartbeat and wrongly
decide that the datanode is dead.
> It would be nice to have two threads instead. One thread is for scanning data directories
and generating block report, and executes the requests sent by NameNode; Another thread is
for sending heartbeats, block reports, and picking up the requests from NameNode. By having
these two threads, the sending of heartbeats will not get delayed by any slow block report
or slow execution of NameNode requests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message