hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yi Liu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-6682) Add a metric to expose the timestamp of the oldest under-replicated block
Date Sat, 01 Aug 2015 00:33:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650070#comment-14650070
] 

Yi Liu edited comment on HDFS-6682 at 8/1/15 12:32 AM:
-------------------------------------------------------

Thanks Allen, Andrew and Akira for the discussion.

Our original intention is to solve issue which is good, thank you for working on it.  About
the discussion itself, Andrew's suggestion is good, and another option is to record latest
time of {{UnderReplicatedBlocks#chooseUnderReplicatedBlocks}}, and we already have metrics
about the {{underReplicatedBlocksCount/pendingReplicationBlocksCount/scheduledReplicationBlocksCount}},
so we can know whether/how long the under replica list is handled since last time if we really
want to see.   My point is not worth to record whole under replicated list for this metric.

On the other hand, we have {{UnderReplicatedBlocks}} and {{PendingReplicationBlocks}}, right?
Replication monitor thread will periodically pick up some under replicated blocks, unless
the NN stops (e.g, full gc), compute replication work will always happen in some CPU time
slice, of course it could be slow since there maybe many things need to be handled in NN (e.g.
many requests). But actually if NN is slow, we have many ways to know it.  About Akira's comment
about the metric is also about the entire HDFS cluster, we talk DataNode here, I think more
correctly thing it's to record the timeout number of pending replication blocks ({{PendingReplicationBlocks}})
if network is very busy or target DNs corrupted if we want to get the Cluster health from
replication blocks' review,   {{UnderReplicatedBlocks}} can't stand for that.

So if we want to have some metrics about the replicated blocks in NN, let's find some lightweight
way as suggested, thanks.



was (Author: hitliuyi):
Thanks Allen, Andrew and Akira for the discussion.

Our original intention is to solve issue which is good, thank you for working on it.  About
the discussion itself, Andrew's suggestion is good, and another option is to record latest
time of {{UnderReplicatedBlocks#chooseUnderReplicatedBlocks}}, and we already have metrics
about the {{underReplicatedBlocksCount/pendingReplicationBlocksCount/scheduledReplicationBlocksCount}},
so we can know whether/how long the under replica list is handled since last time if we really
want to see.   My point is not worth to record whole under replicated list for this metric.

On way other hand, we have {{UnderReplicatedBlocks}} and {{PendingReplicationBlocks}}, right?
Replication monitor thread will periodically pick up some under replicated blocks, unless
the NN stops (e.g, full gc), compute replication work will always happen in some CPU time
slice, of course it could be slow since there maybe many things need to be handled in NN (e.g.
many requests). But actually if NN is slow, we have many ways to know it.  About Akira's comment
about the metric is also about the entire HDFS cluster, we talk DataNode here, I think more
correctly thing it's to record the timeout number of pending replication blocks ({{PendingReplicationBlocks}})
if network is very busy or target DNs corrupted if we want to get the Cluster health from
replication blocks' review,   {{UnderReplicatedBlocks}} can't stand for that.

So if we want to have some metrics about the replicated blocks in NN, let's find some lightweight
way as suggested, thanks.


> Add a metric to expose the timestamp of the oldest under-replicated block
> -------------------------------------------------------------------------
>
>                 Key: HDFS-6682
>                 URL: https://issues.apache.org/jira/browse/HDFS-6682
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Akira AJISAKA
>            Assignee: Akira AJISAKA
>              Labels: metrics
>         Attachments: HDFS-6682.002.patch, HDFS-6682.003.patch, HDFS-6682.004.patch, HDFS-6682.005.patch,
HDFS-6682.006.patch, HDFS-6682.patch
>
>
> In the following case, the data in the HDFS is lost and a client needs to put the same
file again.
> # A Client puts a file to HDFS
> # A DataNode crashes before replicating a block of the file to other DataNodes
> I propose a metric to expose the timestamp of the oldest under-replicated/corrupt block.
That way client can know what file to retain for the re-try.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message