hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
Date Wed, 01 Apr 2015 20:13:53 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391353#comment-14391353
] 

Andrew Wang commented on HDFS-7725:
-----------------------------------

Thanks for working on this Ming. Nice find, patch looks basically good. Just a few comments:

I agree with Zhe's original review comment above, I think we should move the liveness check
for both start and stop into heartbeat manager. This way the caller doesn't have to worry
about it.

It would also be good to add "alive" or "dead" to the first log in stopDecommission too, just
to give admins some more information about node state.

Do we also need assert checks in the test after recommissioning the dead node?

> Incorrect "nodes in service" metrics caused all writes to fail
> --------------------------------------------------------------
>
>                 Key: HDFS-7725
>                 URL: https://issues.apache.org/jira/browse/HDFS-7725
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7725-2.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault
complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when
admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional.
cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message