hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Esteban Gutierrez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6184) Better health check from ZKFC
Date Wed, 02 Apr 2014 07:13:15 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957419#comment-13957419
] 

Esteban Gutierrez commented on HDFS-6184:
-----------------------------------------

{quote}
1. Have ZKFC made decision based on NN thread dump.
{quote}
I think if you can get a thread dump of the NN 2. it should be fine just my suggestion for
2.

{quote}
2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need to acquire
NN global lock; so it can go through even if NN is doing checkpointing or very busy.
{quote}
Have you tried to use {{dfs.namenode.servicerpc-address}} in a different port and bump {{dfs.namenode.service.handler.count}}
to higher number? I have seen that works fine to avoid that issues.


> Better health check from ZKFC
> -----------------------------
>
>                 Key: HDFS-6184
>                 URL: https://issues.apache.org/jira/browse/HDFS-6184
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Ming Ma
>
> We have seen several false positives in terms of when ZKFC considers NN to be unhealthy.
Some of these triggers unnecessary failover. Examples,
> 1. SBN checkpoint caused ZKFC's RPC call into NN timeout. The consequence isn't bad;
just that SBN will quit ZK membership and rejoin it later. But it is unnecessary. The reason
is checkpoint acquires NN global write lock and all rpc requests are blocked. Even though
HAServiceProtocol.monitorHealth doesn't need to acquire NN lock; it still needs to user service
rpc queue.
> 2. When ANN is busy, sometimes the global lock can block other requests. ZKFC's RPC call
timeout. This will trigger failover. The question is even if after the failover, the new ANN
might run into similar issue.
> We can increase ZKFC to NN timeout value to mitigate this to some degree. If ZKFC can
be more accurate in judgment if NN is health or not and can predict the failover will help,
that will be useful. For example, we can,
> 1. Have ZKFC made decision based on NN thread dump.
> 2. Have a dedicated rpc pool for ZKFC > NN. Given health check doesn't need to acquire
NN global lock; so it can go through even if NN is doing checkpointing or very busy.
> Any comments?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message