hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tomasz Nykiel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-395) DFS Scalability: Incremental block reports
Date Fri, 01 Jul 2011 20:17:28 GMT

    [ https://issues.apache.org/jira/browse/HDFS-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058766#comment-13058766

Tomasz Nykiel commented on HDFS-395:

To revive this discussion, I would have few comments about block reports.

Processing a single block report has mainly two parts. The first part is to compute the difference
between NN state and the report.
The second stage involves applying the diff to NN - which contains blocksToAdd, blocksToRemove,

1. Now we have explicit Received requests sent form datanodes to the namenode,
hence the number of blocksToAdd should be very small.

2. blocksToInvalidate consist of blocks that do not belong to any file, which is a abnormal
Hence, the number thereof should also be very small.

3. We do not have explicit ACKs for blocks that are deleted from datanodes. Hence, the diff
will contain all such deleted blocks.
If we extend the block report interval to let's say 24 hours, we migh have a huge number of
blocks that have been deleted, and need to be processed.

I think it will be very beneficial to introduce explicit deletion acks.
Since we care more about the blockReceived in general, than blockDeleted, we can send the
deletion acks whenever we have any blockReceived acks to sent at the datanode side.
Otherwise, we send the block deletion acks in some interval.

With this change, the block report interval can be extended to let's say two times the basic

-- Also a small change need to be made for sending blockDeleted ack.
When the block invalidate command comes to the datanode, it should synchronously rename the
block file (to some invalid block name).
Then it can immediately notify the namenode that the block has been deleted.

-- Another improvement can be made on NN side: We can process the entire list of blocks within
the FSNamesystem instead of processing them one-by-one.
This saves some repetitive computation.

I am attaching a diff that introduces these changes.

> DFS Scalability: Incremental block reports
> ------------------------------------------
>                 Key: HDFS-395
>                 URL: https://issues.apache.org/jira/browse/HDFS-395
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: blockReportPeriod.patch
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends
a block report to the namenode once every hour. This means that the namenode processes a block
report once every 2 seconds. Each block report contains all blocks that the datanode currently
hosts. This makes the namenode compare a huge number of blocks that practically remains the
same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block
report) be incremental. This will make the namenode process only those blocks that were added/deleted
in the last period.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message