hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5140) Too many safemode monitor threads being created in the standby namenode causing it to fail with out of memory error
Date Thu, 29 Aug 2013 17:58:53 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753877#comment-13753877
] 

Jing Zhao commented on HDFS-5140:
---------------------------------

Looks like the problem is caused by the following code (FSNamesystem#checkMode):
{code}
      reached = now();
      smmthread = new Daemon(new SafeModeMonitor());
      smmthread.start();
      reportStatus("STATE* Safe mode extension entered.", true);
{code}

In SBN, because the block threshold keeps being adjusted while tailing the editlog, we may
have the following scenarios:

reach the block threshold, enter the final 30 seconds of safemode --> block threshold is
adjusted, and the number of safe block cannot reach the threshold --> reach the block threshold
again....

Because of the above code, each time the block threshold is met, a new safemode monitor thread
will be created while the old one keeps running behind. Thus a large number of safemode monitor
threads can be created. This code is fine in the active NN (or the NN in non-HA setup) because
we do not adjust block threshold there and once the NN goes out of the safemode it will not
go in again.
                
> Too many safemode monitor threads being created in the standby namenode causing it to
fail with out of memory error
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5140
>                 URL: https://issues.apache.org/jira/browse/HDFS-5140
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.1.0-beta
>            Reporter: Arpit Gupta
>            Assignee: Jing Zhao
>            Priority: Blocker
>
> While running namenode load generator with 100 threads for 10 mins namenode was being
failed over ever 2 mins.
> The standby namenode shut itself down as it ran out of memory and was not able to create
another thread.
> When we searched for 'Safe mode extension entered' in the standby log it was present
55000+ times

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message