hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-395) DFS Scalability: Incremental block reports
Date Fri, 01 Jul 2011 23:35:29 GMT

    [ https://issues.apache.org/jira/browse/HDFS-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058866#comment-13058866

Scott Carey commented on HDFS-395:

from my earlier experience with developing local file systems (VxFS, ufs, XFS, etc), the cost
of renaming a file is precisely the same as the cost of deleting a file.
With some file systems, yes.  With others, not at all.

XFS it is about the same speed to delete as rename, as long as the number of extents is low.

For ext3, the cost to delete is FAR higher than the cost to rename for large files.   Make
a 2 GB file on ext3, time how long it takes to rename (almost instant) and then time how long
it takes to remove (long!).  This is because ext3 does not have extents, and must update the
entry for every page in the  file in order to delete.  With 4k blocks, it is a half million
entries that must be changed in a 2GB file.

ext4 is faster at this, but still more costly than a move.

I'm not sure either of these can be synchronous, and it may be best to batch up delete acks
done asynchronously.   Perhaps the block report should ignore 'deletes in progress' when reporting
to the NN to avoid that race condition, or list them separately in the block report so the
namenode has an opportunity to act on that information.

> DFS Scalability: Incremental block reports
> ------------------------------------------
>                 Key: HDFS-395
>                 URL: https://issues.apache.org/jira/browse/HDFS-395
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: blockReportPeriod.patch, explicitDeleteAcks.patch
> I have a cluster that has 1800 datanodes. Each datanode has around 50000 blocks and sends
a block report to the namenode once every hour. This means that the namenode processes a block
report once every 2 seconds. Each block report contains all blocks that the datanode currently
hosts. This makes the namenode compare a huge number of blocks that practically remains the
same between two consecutive reports. This wastes CPU on the namenode.
> The problem becomes worse when the number of datanodes increases.
> One proposal is to make succeeding block reports (after a successful send of a full block
report) be incremental. This will make the namenode process only those blocks that were added/deleted
in the last period.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message