hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Liang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-12043) Add counters for block re-replication
Date Thu, 29 Jun 2017 00:02:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-12043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Chen Liang updated HDFS-12043:
    Attachment: HDFS-12043.002.patch

Thanks [~arpitagarwal] for the comments! Post v002 patch.

The other comments are addressed. Regarding the {{if (pendingNum > 0)}} branch, my understanding
is that this means there are unfinished replication going on, not necessarily failed re-replication.
It could still finish successfully, also it may timeout and counted by the other timeout counter.
What do you think?

Also in v002 patch, changed the place of incrementing timeout re-replication to the place
where it gets detected in {{PendingReconstructionBlocks}}'s thread. v001 patch actually delays
the increment by calling in {{BlockManager}}'s thread.

> Add counters for block re-replication
> -------------------------------------
>                 Key: HDFS-12043
>                 URL: https://issues.apache.org/jira/browse/HDFS-12043
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Chen Liang
>            Assignee: Chen Liang
>         Attachments: HDFS-12043.001.patch, HDFS-12043.002.patch
> We occasionally see that the under-replicated block count is not going down quickly enough.
We've made at least one fix to speed up block replications (HDFS-9205) but we need better
insight into the current state and activity of the block re-replication logic. For example,
we need to understand whether is it because re-replication is not making forward progress
at all, or is it because new under-replicated blocks are being added faster.
> We should include additional metrics:
> # Cumulative number of blocks that were successfully replicated. 
> # Cumulative number of re-replications that timed out.
> # Cumulative number of blocks that were dequeued for re-replication but not scheduled
e.g. because they were invalid, or under-construction or replication was postponed.
> The growth rate of of the above metrics will make it clear whether block replication
is making forward progress and if not then provide potential clues about why it is stalled.
> Thanks [~arpitagarwal] for the offline discussions.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message