hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3323) Name node should notify administrator if when struggling with replication
Date Tue, 15 Jul 2008 22:33:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613759#action_12613759

Chris Douglas commented on HADOOP-3323:

After some discussion, it's become clear that this may be completed in two parts:

# A brief health check the namenode can perform itself
# A metrics-based solution tracking namenode throughput over time, capable of inferring more
complex and nuanced desperation

Work on (2) will fall out of a generalized metrics reporting and alerting mechanism to be
completed in concert with HADOOP-3719. The particular set of metrics and implementation will
remain in this JIRA. Specifically, the implementation will likely correlate the size of the
replication queue (FSNamesystemMetrics::pendingReplicationBlocks) with Datanode metrics tracking
replicated blocks (DataNodeMetrics::blocksReplicated) aggregated across the cluster. The intent
would be to track replication throughput, presuming that slow replication at the datanodes,
a slow-draining replication queue, and low storage capacity would accurately capture the conditions
called out here.

In a separate JIRA, (1) will track a ping-like facility for querying the baseline health of
the Namenode. In particular, it will verify that all expected threads are alive, perform inexpensive
sanity checks on data structures, etc. Administrators periodically running this check can
configure/attach to the notification scheme used in their deployment.

> Name node should notify administrator if when struggling with replication
> -------------------------------------------------------------------------
>                 Key: HADOOP-3323
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3323
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Robert Chansler
> Name node performance suffers if either the replication queue is to big, or the avail
space at data nodes is too small. In either case, the administrator should be notified.
> If the situation is really desperate, the name node perhaps should enter safe mode.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message