hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-9311) Support optional offload of NameNode HA service health checks to a separate RPC server.
Date Tue, 09 Feb 2016 22:10:18 GMT

     [ https://issues.apache.org/jira/browse/HDFS-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Nauroth updated HDFS-9311:
--------------------------------
    Release Note: There is now support for offloading HA health check RPC activity to a separate
RPC server endpoint running within the NameNode process.  This may improve reliability of
HA health checks and prevent spurious failovers in highly overloaded conditions.  For more
details, please refer to the hdfs-default.xml documentation for properties dfs.namenode.lifeline.rpc-address,
dfs.namenode.lifeline.rpc-bind-host and dfs.namenode.lifeline.handler.count.

> Support optional offload of NameNode HA service health checks to a separate RPC server.
> ---------------------------------------------------------------------------------------
>
>                 Key: HDFS-9311
>                 URL: https://issues.apache.org/jira/browse/HDFS-9311
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ha, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>             Fix For: 2.8.0
>
>         Attachments: HDFS-9311.001.patch, HDFS-9311.002.patch, HDFS-9311.003.patch
>
>
> When a NameNode is overwhelmed with load, it can lead to resource exhaustion of the RPC
handler pools (both client-facing and service-facing).  Eventually, this blocks the health
check RPC issued from ZKFC, which triggers a failover.  Depending on fencing configuration,
the former active NameNode may be killed.  In an overloaded situation, the new active NameNode
is likely to suffer the same fate, because client load patterns don't change after the failover.
 This can degenerate into flapping between the 2 NameNodes without real recovery.  If a NameNode
had been killed by fencing, then it would have to transition through safe mode, further delaying
time to recovery.
> This issue proposes a separate, optional RPC server at the NameNode for isolating the
HA health checks.  These health checks are lightweight operations that do not suffer from
contention issues on the namesystem lock or other shared resources.  Isolating the RPC handlers
is sufficient to avoid this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message