hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages
Date Tue, 28 Apr 2015 00:36:08 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516053#comment-14516053

Colin Patrick McCabe commented on HDFS-7923:

Thanks, [~clamb].  I like this approach.  It avoids sending the block report until the NN
requests it.  So we don't have to throw away a whole block report to achieve backpressure.

  public static final String  DFS_NAMENODE_MAX_CONCURRENT_BLOCK_REPORTS_KEY = "dfs.namenode.max.concurrent.block.reports";
It seems like this should default to something less than the default number of RPC handler
threads, not to MAX_INT.  Given that dfs.namenode.handler.count = 10, it seems like this should
be no more than 5 or 6, right?  The main point here to avoid having the NN handler threads
completely choked with block reports, and that is defeated if the value is MAX_INT.  I realize
that you probably intended this to be configured.  But it seems like we should have a reasonable
default that works for most people.

--- hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto
@@ -195,6 +195,7 @@ message HeartbeatRequestProto {
   optional uint64 cacheCapacity = 6 [ default = 0 ];
   optional uint64 cacheUsed = 7 [default = 0 ];
   optional VolumeFailureSummaryProto volumeFailureSummary = 8;
+  optional bool requestSendFullBlockReport = 9;
Let's have a {{\[default = false\]}} here so that we don't have to add a bunch of clunky {{HasFoo}}
checks.  Unless there is something we'd like to do differently in the "false" and "not present"
cases, but I can't think of what that would be.

+  /* Number of block reports currently being processed. */
+  private final AtomicInteger blockReportProcessingCount = new AtomicInteger(0);
I'm not sure an {{AtomicInteger}} makes sense here.  We only modify this variable (write to
it) when holding the FSN lock in write mode, right?  And we only read from it when holding
the FSN in read mode.  So, there isn't any need to add atomic ops.

+      boolean okToSendFullBlockReport = true;
+      if (requestSendFullBlockReport &&
+          blockManager.getBlockReportProcessingCount() >=
+              maxConcurrentBlockReports) {
+        /* See if we should tell DN to back off for a bit. */
+        final long lastBlockReportTime = blockManager.getDatanodeManager().
+            getDatanode(nodeReg).getLastBlockReportTime();
+        if (lastBlockReportTime > 0) {
+          /* We've received at least one block report. */
+          final long msSinceLastBlockReport = now() - lastBlockReportTime;
+          if (msSinceLastBlockReport < maxBlockReportDeferralMsec) {
+            /* It hasn't been long enough to allow a BR to pass through. */
+            okToSendFullBlockReport = false;
+          }
+        }
+      }
+      return new HeartbeatResponse(cmds, haState, rollingUpgradeInfo,
+          okToSendFullBlockReport);
There is a TOCTOU (time of check, time of use) race condition here, right?  1000 datanodes
come in and ask me whether it's ok to send an FBR.  In each case, I check the number of ongoing
FBRs, which is 0, and say "yes."  Then 1000 FBRs arrive all at once and the NN melts down.

I think we need to track which datanodes we gave the "green light" to, and not decrement the
counter until they either send that report, or some timeout expires.  (We need the timeout
in case datanodes go away after requesting permission-to-send.)  The timeout can probably
be as short as a few minutes.  If you can't manage to send an FBR in a few minutes, there's
more problems going on.

  public static final String  DFS_BLOCKREPORT_MAX_DEFER_MSEC_KEY = "dfs.blockreport.max.deferMsec";
  public static final long    DFS_BLOCKREPORT_MAX_DEFER_MSEC_DEFAULT = Long.MAX_VALUE;
Do we really need this config key?  It seems like we added it because we wanted to avoid starvation
(i.e. the case where a given DN never gets given the green light).  But we are maintaining
the last FBR time for each DN anyway.  Surely we can just have a TreeMap or something and
just tell the guys with the oldest {{lastSentTime}} to go.  There aren't an infinite number
of datanodes in the cluster, so eventually everyone will get the green light.

I really would prefer not to have this tunable at all, since I think it's unnecessary.  In
any case, it's certainly doing us no good as MAX_U64.

> The DataNodes should rate-limit their full block reports by asking the NN on heartbeat
> -----------------------------------------------------------------------------------------------
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Colin Patrick McCabe
>            Assignee: Charles Lamb
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch
> The DataNodes should rate-limit their full block reports.  They can do this by first
sending a heartbeat message to the NN with an optional boolean set which requests permission
to send a full block report.  If the NN responds with another optional boolean set, the DN
will send an FBR... if not, it will wait until later.  This can be done compatibly with optional

This message was sent by Atlassian JIRA

View raw message