hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Bockelman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1667) Consider redesign of block report processing
Date Mon, 28 Feb 2011 12:37:36 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000262#comment-13000262
] 

Brian Bockelman commented on HDFS-1667:
---------------------------------------

Hi Matt,

Just a small comment about (1).  I believe it's in our interest to report all the blocks,
rather than just provide a differential report.  In past experience with other systems that
used a similar idea, it ended up in a big mess.  Basically, if you have unknown software bugs
(every project does), you may have small omissions to the block report which accumulate over
time until the NN has a wildly distorted view of the DN.  By doing a full report, you "reset"
any errors that occur.

Of course, a hybrid approach is often palatable: report the 'diff' every hour and the 'full'
every 4th hour.

Brian

> Consider redesign of block report processing
> --------------------------------------------
>
>                 Key: HDFS-1667
>                 URL: https://issues.apache.org/jira/browse/HDFS-1667
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>
> The current implementation of block reporting has the datanode send the entire list of
blocks resident on that node in every report, hourly.  BlockManager.processReport in the namenode
runs each block report through reportDiff() first, to build four linked lists of replicas
for different dispositions, then processes each list.  During that process, every block belonging
to the node (even the unchanged blocks) are removed and re-linked in that node's blocklist.
 The entire process happens under the global FSNamesystem write lock, so there is essentially
zero concurrency.  It takes about 90 milliseconds to process a single 50,000-replica block
report in the namenode, during which no other read or write activities can occur.
> There are several opportunities for improvement in this design.  In order of probable
benefit, they include:
> 1. Change the datanode to send a differential report.  This saves the namenode from having
to do most of the work in reportDiff(), and avoids the need to re-link all the unchanged blocks
during the "diff" calculation.
> 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it can be done
under a read lock instead of the write lock.  Then only the processing of the lists needs
to be under the write lock.  Since the number of blocks changed is usually much smaller than
the number unchanged, this should improve concurrency.
> 3. Eliminate the linked lists and just immediately process each changed block as it is
read from the block report.  The work on HDFS-1295 indicates that this might save a large
fraction of the total block processing time at scale, due to the much smaller number of objects
created and garbage collected during processing of hundreds of millions of blocks.
> 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock().  It currently
uses linked lists as an argument to processReportedBlock() even though addBlock() only processes
a single replica on each call.  This should be replaced with a call to immediately process
the block instead of enqueuing it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message