hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4233) NN keeps serving even after no journals started while rolling edit
Date Wed, 28 Nov 2012 22:25:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505972#comment-13505972

Kihwal Lee commented on HDFS-4233:

bq. I am unsure of what fd exhaustion means (is it hitting nofile limits?),....

Yes. In a very big cluster, we've seen NN running out of 64K file descriptors. I was told
that it can be raised further (e.g. 1M) without much negative impact on performance, at least
on Linux. So there are ways to avoid it or minimize the possibility, but NN still needs to
be able to deal with the situation.

Monitoring and limiting number of connections can be tricky. Ideally we want the average number
to be reasonable, but also want NN to absorb a short burst of requests instead of rejecting
them. The client-side retry mechanism will require some changes, if IPC start actively rejecting
requests. The things get very nasty if IPC connections get "reset" or fall into syn backlog
and stay there for long. Massive lease renewal failures will likely occur and that will cause
block recoveries and so on. In short, protecting namenode might be simple, but that sometimes
actually hurt cluster availability.

> NN keeps serving even after no journals started while rolling edit
> ------------------------------------------------------------------
>                 Key: HDFS-4233
>                 URL: https://issues.apache.org/jira/browse/HDFS-4233
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.23.5
>            Reporter: Kihwal Lee
>            Priority: Critical
> We've seen namenode keeps serving even after rollEditLog() failure. Instead of taking
a corrective action or regard this condition as FATAL, it keeps on serving and modifying its
file system state. No logs are written from this point, so if the namenode is restarted, there
will be data loss.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message