hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suresh Srinivas (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2911) Gracefully handle OutOfMemoryErrors
Date Thu, 09 Feb 2012 04:07:59 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204235#comment-13204235
] 

Suresh Srinivas commented on HDFS-2911:
---------------------------------------

bq. @Eli ... as Todd points out not all OOMs are unrecoverable ...
bq. On the NN I'd rather see the critical threads all get uncaughtExceptionHandlers attached
which abort the NN if they fail. So if an individual rpc handler OOMEs (eg by an invalid request
making it try to allocate a 4G array or something) it won't take down the NN, whereas if the
LeaseManager OOMEs it should.

I think this may not be a good idea. Infact I would say, it is more important to shutdown
NN when RPC handler gets an OOME. Lets say an RPC handler updated in memory namespace and
was about add it to editlog. The system was indeed running out of memory and before editlog
could be written the handler got OOME. If we do not shutdown at this time, we could end up
in interesting data corruption issues.

Instead of trying to categorize which one is safe and not safe, we should use kill -9 option.
In cases where OOME is caused by the system trying to create a large object, we could add
appropriate size/limit checks.
                
> Gracefully handle OutOfMemoryErrors
> -----------------------------------
>
>                 Key: HDFS-2911
>                 URL: https://issues.apache.org/jira/browse/HDFS-2911
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node, name-node
>    Affects Versions: 0.23.0, 1.0.0
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>
> We should gracefully handle j.l.OutOfMemoryError exceptions in the NN or DN. We should
catch them in a high-level handler, cleanly fail the RPC (vs sending back the OOM stackrace)
or background thread, and shutdown the NN or DN. Currently the process is left in a not well-test
tested state (continuously fails RPCs and internal threads, may or may not recover and doesn't
shutdown gracefully).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message