hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4475) OutOfMemory by BPServiceActor.offerService() takes down DataNode
Date Fri, 08 Feb 2013 01:49:13 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13574152#comment-13574152
] 

Konstantin Shvachko commented on HDFS-4475:
-------------------------------------------

I understand we are not catching OOM. But the problem still remains. People starting the cluster
with default configuration ending up with dysfunctional cluster and dead DataNodes.

I propose to adjust the default configuration to avoid the problem. There is clear unbalance
between the default heap size (128 MB) and number of threads we allow. Either the default
heap size should increase or the # of threads should go down.
Plamen you have a reproducible configuration to crash the cluster. Could you investigate how
many BP threads 128 MB can hold? You can reduce the thread count gradually until the cluster
doesn't crash.
                
> OutOfMemory by BPServiceActor.offerService() takes down DataNode
> ----------------------------------------------------------------
>
>                 Key: HDFS-4475
>                 URL: https://issues.apache.org/jira/browse/HDFS-4475
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.0.3-alpha
>            Reporter: Plamen Jeliazkov
>            Assignee: Plamen Jeliazkov
>             Fix For: 3.0.0, 2.0.3-alpha
>
>
> In DataNode, there are catchs around BPServiceActor.offerService() call but no catch
for OutOfMemory as there is for the DataXeiver as introduced in 0.22.0.
> The issue can be replicated like this:
> 1) Create a cluster of X DataNodes and 1 NameNode and low memory settings (-Xmx128M or
something similar).
> 2) Flood HDFS with small file creations (any should work actually).
> 3) DataNodes will hit OoM, stop blockpool service, and shutdown.
> The resolution is to catch the OoMException and handle it properly when calling BPServiceActor.offerService()
in DataNode.java; like as done in 0.22.0 of Hadoop. DataNodes should not shutdown or crash
but remain in a sort of frozen state until memory issues are resolved by GC.
> LOG ERROR:
> 2013-02-04 11:46:01,854 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected
exception in block pool Block pool BP-1105714849-10.10.10.110-1360005776467 (storage id DS-1952316202-10.10.10.112-50010-1360005820993)
service to vmhost2-vm0/10.10.10.110:8020
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> 2013-02-04 11:46:01,854 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending
block pool service for: Block pool BP-1105714849-10.10.10.110-1360005776467 (storage id DS-1952316202-10.10.10.112-50010-1360005820993)
service to vmhost2-vm0/10.10.10.110:8020

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message