hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-13393) Improve OOM logging
Date Tue, 03 Apr 2018 21:38:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wei-Chiu Chuang updated HDFS-13393:
-----------------------------------
    Description: 
It is not uncommon to find "java.lang.OutOfMemoryError: unable to create new native thread"
errors in a HDFS cluster. Most often this happens when DataNode creating DataXceiver threads,
or when balancer creates threads for moving blocks around.

In most of cases, the "OOM" is a symptom of number of threads reaching system limit, rather
than actually running out of memory, and the current logging of this message is usually misleading
(suggesting this is due to insufficient memory)

How about capturing the OOM, and if it is due to "unable to create new native thread", print
some more helpful message like "bump your ulimit" or "take a jstack of the process"?

Even better, surface this error to make it more visible. It usually takes a while for an in-depth
investigation after users notice some job fails, by the time the evidences may already been
gone (like jstack output).

  was:
It is not uncommon to find "java.lang.OutOfMemoryError: unable to create new native thread"
errors in a HDFS cluster. Most often this happens when DataNode creating DataXceiver threads,
or when balancer creates threads for moving blocks around.

In most of cases, the "OOM" is a symptom of number of threads reaching system limit, rather
than actually running out of memory, and the current logging of this message is usually misleading
(suggesting this is due to insufficient memory)

How about capturing the OOM, and if it is due to "unable to create new native thread", print
some more helpful message like "bump your ulimit" or "take a jstack of the process"?


> Improve OOM logging
> -------------------
>
>                 Key: HDFS-13393
>                 URL: https://issues.apache.org/jira/browse/HDFS-13393
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer &amp; mover, datanode
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> It is not uncommon to find "java.lang.OutOfMemoryError: unable to create new native thread"
errors in a HDFS cluster. Most often this happens when DataNode creating DataXceiver threads,
or when balancer creates threads for moving blocks around.
> In most of cases, the "OOM" is a symptom of number of threads reaching system limit,
rather than actually running out of memory, and the current logging of this message is usually
misleading (suggesting this is due to insufficient memory)
> How about capturing the OOM, and if it is due to "unable to create new native thread",
print some more helpful message like "bump your ulimit" or "take a jstack of the process"?
> Even better, surface this error to make it more visible. It usually takes a while for
an in-depth investigation after users notice some job fails, by the time the evidences may
already been gone (like jstack output).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message