hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
Date Sat, 22 Apr 2017 02:15:04 GMT

     [ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Konstantin Shvachko updated HDFS-11384:
    Attachment: HDFS-11384.006.patch

* You are right, the rate of {{getBlocks}} RPCs is not guaranteed. Balancer can only do its
best. The actual rate can be only guaranteed on the NameNode, but we don't want to go there.
I made it clear in the comment for {{BALANCER_NUM_RPC_PER_SEC}}.
* Added a decryption for delay.
* It is pretty hard to measure the rate of operations on NN. Here is what I did.
Created a spy FSNamesystem. The spy would call a modified {{getBlocks()}} when the corresponding
RPC is called.
The modified {{getBlocks()}} first calls the original method, then counts the number of calls
and the time of the first and the last call to {{getBlocks()}}. Given the number of calls
and the interval we can estimate the rate later on.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength
> -------------------------------------------------------------------------------------------------
>                 Key: HDFS-11384
>                 URL: https://issues.apache.org/jira/browse/HDFS-11384
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer & mover
>    Affects Versions: 2.7.3
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch, HDFS-11384.002.patch,
HDFS-11384.003.patch, HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch
> When running balancer on hadoop cluster which have more than 3000 Datanodes will cause
NameNode's rpc.CallQueueLength spike. We observed this situation could cause Hbase cluster
failure due to RegionServer's WAL timeout.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message