hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhe Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-13657) IPC Reader thread could silently die and leave NameNode unresponsive
Date Mon, 26 Sep 2016 16:33:20 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-13657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Zhe Zhang updated HADOOP-13657:
-------------------------------
    Description: 
For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving
{{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from runtime
exception. Then the {{pendingConnections}} queue became full and the NameNode port became
inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883.
But in general, other types of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of runtime exceptions,
or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins
get alerted of the real issue.

  was:
For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge of moving
{{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}.

We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from run
time exception. Then the {{pendingConnections}} queue became full and the NameNode port became
inaccessible.

In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883.
But in general, other types of runtime exceptions could cause this issue as well.

We should add logic to either make the {{Reader}} more robust in case of runtime exceptions,
or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins
get alerted of the real issue.


> IPC Reader thread could silently die and leave NameNode unresponsive
> --------------------------------------------------------------------
>
>                 Key: HADOOP-13657
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13657
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>            Reporter: Zhe Zhang
>            Priority: Critical
>
> For each listening port, IPC {{Server#Listener#Reader}} is a single thread in charge
of moving {{Connection}} items from {{pendingConnections}} (capacity 100) to the {{callQueue}}.
> We have experienced an incident where the {{Reader}} thread for HDFS NameNode died from
runtime exception. Then the {{pendingConnections}} queue became full and the NameNode port
became inaccessible.
> In our particular case, what killed {{Reader}} was a NPE caused by https://bugs.openjdk.java.net/browse/JDK-8024883.
But in general, other types of runtime exceptions could cause this issue as well.
> We should add logic to either make the {{Reader}} more robust in case of runtime exceptions,
or at least treat it as a FATAL exception so that NameNode can fail over to standby, and admins
get alerted of the real issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message