hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
Date Sat, 08 Apr 2017 01:11:41 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Konstantin Shvachko updated HDFS-11384:
---------------------------------------
    Attachment: HDFS-11384.003.patch

Here is a relatively simple patch, which restricts the number of RPC calls from Balancer to
NN to 20 calls per second.
20 calls per second is a constant for now. It is chosen so that Balancer calls could not saturate
NN's RPC queue based on metrics from a large cluster I was observing. LMK if people prefer
it to be configurable.
On a large cluster with 200 (default) dispatcher threads, and e.g. 500 underutilized DNs (sources)
the initial 200 RPCs will be dispersed over 200 / 20 = 10 seconds. The remaining 300 RPCs
should disperse organically as they subsequently reuse the same 200 threads from the pool.
The patch has a unit test, which triggers the dispersion logic.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength
spike
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11384
>                 URL: https://issues.apache.org/jira/browse/HDFS-11384
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer & mover
>    Affects Versions: 2.7.3
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch, HDFS-11384.002.patch,
HDFS-11384.003.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes will cause
NameNode's rpc.CallQueueLength spike. We observed this situation could cause Hbase cluster
failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message