hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arpit Agarwal (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-5922) DN heartbeat thread can get stuck in tight loop
Date Sun, 23 Feb 2014 01:35:21 GMT

     [ https://issues.apache.org/jira/browse/HDFS-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Arpit Agarwal updated HDFS-5922:

    Attachment: HDFS-5922.01.patch

Hi Aaron, sorry about the delayed response. I was away. Here's a preliminary patch to get
Jenkins results.

The specific bug here could have been avoided by resetting the counter to zero when emptying
the queues. However it seems unnecessary to maintain an exact count of the pending requests
when all we care about is whether or not there are any requests. The patch replaces the counter
with a boolean.

Andrew Wang also pointed out offline that it is perhaps incorrect to be subtracting the number
of deleted blocks from pendingReceivedRequests in BPServiceActor#reportReceivedDeletedBlocks,
but the result of that is somewhat less serious, since in that case the worst case is just
that we send a somewhat delayed IBR.
This behavior looks odd but it was probably by design. {{pendingReceivedRequests}} was not
incremented for deleted requests to avoid sending an IBR for just deleted blocks before the
timeout interval has elapsed. However when we failed to send an IBR we reinserted all pending
entries into the queue and set {{pendingReceivedRequests}} to be the count of all pending
requests - deleted+received - presumably to avoid waiting for another timeout interval before

> DN heartbeat thread can get stuck in tight loop
> -----------------------------------------------
>                 Key: HDFS-5922
>                 URL: https://issues.apache.org/jira/browse/HDFS-5922
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.3.0
>            Reporter: Aaron T. Myers
>            Assignee: Arpit Agarwal
>         Attachments: HDFS-5922.01.patch
> We saw an issue recently on a test cluster where one of the DN threads was consuming
100% of a single CPU. Running jstack indicated that it was the DN heartbeat thread. I believe
I've tracked down the cause to a bug in the accounting around the value of {{pendingReceivedRequests}}.
> More details in the first comment.

This message was sent by Atlassian JIRA

View raw message