hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages
Date Tue, 16 Jun 2015 20:15:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588663#comment-14588663
] 

Colin Patrick McCabe edited comment on HDFS-7923 at 6/16/15 8:14 PM:
---------------------------------------------------------------------

Starvation is not a real concern here.  Imagine a 1000 node cluster where full block reports
are 6 hours apart.  Then the NN needs to be able to handle 2.7 full block reports a minute.
 If each one takes 500 ms (we'll be pessimistic), then 1.35 out of every 60 seconds is FBR
time, or 2.3% of the time.  If you want to be even more pessimistic and assume each block
report is 1 hour apart rather than 6, just multiple that number by 6 to get 13.8% of the time.

For starvation to happen, you'd have to be spending close to 100% of the time on full block
reports.  That's just not going to happen.  And if it does happen, you have bigger problems,
like not being able to actually do anything on the NameNode (since you're spending all your
time on FBRs, which hold the FSN write lock).

Even if you were spending close to 100% of the time on full block reports, the existing code
doesn't enforce fairness... I can configure one DN to send full block reports every 30 minutes,
and configure everyone else to send every 10 hours.  The FBR period is a datanode-side configuration,
not a NN-side one.

This change is really helpful during startup on big clusters.  In the past we have seen restarting
all the DNs at once on a several hundred node cluster bring the NN to its knees.  All of the
RPC handlers get flooded with FBRs, but only one can make progress at once.  The flood of
FBRs also triggers full GCs, since we can't handle them in a timely fashion and they enter
the oldgen.  I realize that {{dfs.blockreport.initialDelay}} was designed as a workaround,
but it is difficult to know what value to set it to, results in slower startup, and is often
overlooked in real-world deployments.

If we want to work on enforcing fairness on the NN-side, we can do that, but it seems unrelated
to this change to me.  It's also not something we currently do, so it would be nice to see
data showing that it was helpful.


was (Author: cmccabe):
Starvation is not a real concern here.  Imagine a 1000 node cluster where full block reports
are 6 hours apart.  Then the NN needs to be able to handle 1.6 full block reports a minute.
 If each one takes 500 ms (we'll be pessimistic), then 0.5 out of every 90 seconds is FBR
time, or 0.5% of the time.  If you want to be even more pessimistic and assume each block
report is 1 hour apart rather than 6, just multiple that number by 6 to get 3% of the time.

For starvation to happen, you'd have to be spending close to 100% of the time on full block
reports.  That's just not going to happen.  And if it does happen, you have bigger problems,
like not being able to actually do anything on the NameNode (since you're spending all your
time on FBRs, which hold the FSN write lock).

Even if you were spending close to 100% of the time on full block reports, the existing code
doesn't enforce fairness... I can configure one DN to send full block reports every 30 minutes,
and configure everyone else to send every 10 hours.  The FBR period is a datanode-side configuration,
not a NN-side one.

This change is really helpful during startup on big clusters.  In the past we have seen restarting
all the DNs at once on a several hundred node cluster bring the NN to its knees.  All of the
RPC handlers get flooded with FBRs, but only one can make progress at once.  The flood of
FBRs also triggers full GCs, since we can't handle them in a timely fashion and they enter
the oldgen.  I realize that {{dfs.blockreport.initialDelay}} was designed as a workaround,
but it is difficult to know what value to set it to, results in slower startup, and is often
overlooked in real-world deployments.

If we want to work on enforcing fairness on the NN-side, we can do that, but it seems unrelated
to this change to me.  It's also not something we currently do, so it would be nice to see
data showing that it was helpful.

> The DataNodes should rate-limit their full block reports by asking the NN on heartbeat
messages
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: 2.8.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 2.8.0
>
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, HDFS-7923.002.patch, HDFS-7923.003.patch,
HDFS-7923.004.patch, HDFS-7923.006.patch, HDFS-7923.007.patch
>
>
> The DataNodes should rate-limit their full block reports.  They can do this by first
sending a heartbeat message to the NN with an optional boolean set which requests permission
to send a full block report.  If the NN responds with another optional boolean set, the DN
will send an FBR... if not, it will wait until later.  This can be done compatibly with optional
fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message