hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages
Date Fri, 29 May 2015 22:23:17 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565561#comment-14565561

Colin Patrick McCabe commented on HDFS-7923:

bq. Missing config key documentation in hdfs-defaults.xml


bq. requestBlockReportLeaseId: empty catch for unregistered node, we could add some more informative
logging rather than relying on the warn below


bq. I discussed the NodeData structure with Colin offline, wondering why we didn't use a standard
Collection. Colin brought up the reason of reducing garbage, which seems valid. I think we
should consider implementing IntrusiveCollection though rather than writing another.

yes, there will be quite a few of these requests coming in at any given point.  IntrusiveCollection
is an interface rather than an implementation, so I don't think it would help here (it's most
useful when an element needs to be in multiple lists at once, and when you need fancy operations
like finding the list from the element)

bq. I also asked about putting NodeData into DatanodeDescriptor. Not sure what the conclusion
was on this, it might reduce garbage since we don't need a separate NodeData object.

The locking is easier to understand if all the lease data is inside {{BlockReportLeaseManager}}.

bq. I prefer Precondition checks for invalid configuration values at startup, so there aren't
any surprises for the user. Not everyone reads the messages on startup.


bq. requestLease has a check for isTraceEnabled, then logs at debug level


bq. In offerService, we ignore the new leaseID if we already have one. On the NN though, a
new request wipes out the old leaseID, and processReport checks based on leaseID rather than
node. This kind of bug makes me wonder why we really need the leaseID at all, why not just
attach a boolean to the node? Or if it's in the deferred vs. pending list?

It's safer for the NameNode to wipe the old lease ID every time there is a new request.  It
avoids problems where the DN went down while holding a lease, and then came back up.  We could
potentially also avoid those problems by being very careful with node (un)registration, but
why make things more complicated than they need to be?  I do think that the DN should overwrite
its old lease ID if the NN gives it a new one, for the same reason.  Let me change it to do
that... Of course this code path should never happen since the NN should never give a new
lease ID when none was requested.  So calling this a "bug" seems like a bit of a stretch.

I prefer IDs to simply checking against the datanode UUID, because lease IDs allow us to match
up the NN granting a lease with the DN accepting and using it, which is very useful for debugging
or understanding what is happening in production.  It also makes it very obvious whether a
DN is "cheating" by sending a block report with leaseID = 0 to disable rate-limiting.  This
is a use-case we want to support but we also want to know when it is going on.

bq. Can we fix the javadoc for scheduleBlockReport to mention randomness, and not "send...at
the next heartbeat?" Incorrect right now.

I looked pretty far back into the history of this code.  It seems to go back to at least 2009.
 The underlying ideas seem to be:
1. the first full block report can have a configurable delay in seconds expressed by {{dfs.blockreport.initialDelay}}
2. the second full block report gets a random delay between 0 and {{dfs.blockreport.intervalMsec}}
3. all other block reports get an interval of {{dfs.blockreport.intervalMsec}} *unless* the
previous block report had a longer interval than expected... if the previous one had a longer
interval than expected, the next one gets a shorter interval.

We can keep behavior #1... it's simple to implement and may be useful for testing (although
I think this patch makes it no longer necessary).

Behavior #2 seems like a workaround for the lack of congestion control in the past.  In a
world where the NN rate-limits full block reports, we don't need this behavior to prevent
FBRs from "clumping".  They will just naturally not overly clump because we are rate-limiting

Behavior #3 just seems incorrect, even without this patch.  By definition, a full block report
contains all the information the NN needs to understand the DN state.  Just because block
report interval N was longer than expected, seems no reason to shorten block report interval
N+1.  In fact, this behavior seems like it could lead to congestion collapse... if the NN
gets overloaded and can't handle block reports for some time, a bunch of DNs will shorten
the time in between the current block report and the next one, further increasing total NN
load.  Not good.  Not good at all.

I replaced this with a simple "randomize first block report time within 0 and {{dfs.blockreport.initialDelay}},
then try to do all other block reports after {{dfs.blockreport.intervalMsec}} ms.  If the
full block report interval was more than 2x what was configured, we whine about it in the
log file (should only happen if the NN is under extreme load).

bq. Have you thought about moving the BR scheduler to the NN side? We still rely on the DNs
to jitter themselves and do the initial delay, but we could have the NN handle all this. This
would also let the NN trigger FBRs whenever it wants. We could also do better than random
scheduling, i.e. stride it rather than jitter. Incompatible, so we probably won't, but fun
to think about 

Yeah, we could do more on the NN to ensure fairness and so forth.  I think it's not really
a big problem since datanodes aren't evil, and the existing method of configuring BR period
on the datanode side seems to be working well.  We also tend to assume uniform cluster load
in HDFS, an assumption which makes complex BR scheduling less interesting.  But maybe some

bq. Could we do the BRLManager register/unregister in addDatanode and removeDatanode? I think
this is safe, since on a DN restart it'll provide a lease ID of 0 and FBR, even without a

Seems reasonable.  I moved the BRLM register/unregister calls from registerDatanode into addDatanode
/ removeDatanode.

> The DataNodes should rate-limit their full block reports by asking the NN on heartbeat
> -----------------------------------------------------------------------------------------------
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, HDFS-7923.002.patch, HDFS-7923.003.patch
> The DataNodes should rate-limit their full block reports.  They can do this by first
sending a heartbeat message to the NN with an optional boolean set which requests permission
to send a full block report.  If the NN responds with another optional boolean set, the DN
will send an FBR... if not, it will wait until later.  This can be done compatibly with optional

This message was sent by Atlassian JIRA

View raw message