hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ming Ma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-6425) Large postponedMisreplicatedBlocks has impact on blockReport latency
Date Tue, 16 Dec 2014 00:28:14 GMT

     [ https://issues.apache.org/jira/browse/HDFS-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Ming Ma updated HDFS-6425:
    Attachment: HDFS-6425-3.patch

Thanks, Kihwal.

Here is the updated patch for trunk based on a slightly different version. In rescanPostponedMisreplicatedBlocks,
instead of always picking the first blocksPerRescan blocks, the new version randomly selects
blocksPerRescan consecutive blocks. This is to handle the case if for some reason some datanodes
remain in content stale state for a long time and only impact the first blocksPerRescan blocks.

This new version has been running on our production clusters for couple months.

Regarding the root cause of over replication. We did some analysis a while back. It could
be due to the IBR scenario you mentioned. There are also other sources.

1. Load balancer could create spike of over replication in our clusters.
2. As part of machine repair process, we used to bring "unformatted" machines back the cluster.
3. It appears right after NN startup and leave safe mode but before all DNs send blockreport,
NN will consider some blocks under replicated and start replication process. Later after the
remaining DNs send blockreport, NN will get into over replicated situation.

> Large postponedMisreplicatedBlocks has impact on blockReport latency
> --------------------------------------------------------------------
>                 Key: HDFS-6425
>                 URL: https://issues.apache.org/jira/browse/HDFS-6425
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-6425-2.patch, HDFS-6425-3.patch, HDFS-6425-Test-Case.pdf, HDFS-6425.patch
> Sometimes we have large number of over replicates when NN fails over. When the new active
NN took over, over replicated blocks will be put to postponedMisreplicatedBlocks until all
DNs for that block aren't stale anymore.
> We have a case where NNs flip flop. Before postponedMisreplicatedBlocks became empty,
NN fail over again and again. So postponedMisreplicatedBlocks just kept increasing until the
cluster is stable. 
> In addition, large postponedMisreplicatedBlocks could make rescanPostponedMisreplicatedBlocks
slow. rescanPostponedMisreplicatedBlocks takes write lock. So it could slow down the block
report processing.

This message was sent by Atlassian JIRA

View raw message