Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C77417E54 for ; Wed, 22 Oct 2014 15:27:35 +0000 (UTC) Received: (qmail 67894 invoked by uid 500); 22 Oct 2014 15:27:35 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 67854 invoked by uid 500); 22 Oct 2014 15:27:35 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 67835 invoked by uid 99); 22 Oct 2014 15:27:34 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Oct 2014 15:27:34 +0000 Date: Wed, 22 Oct 2014 15:27:34 +0000 (UTC) From: "Ming Ma (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2714) Localizer thread might stuck if NM is OOM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180042#comment-14180042 ] Ming Ma commented on YARN-2714: ------------------------------- Thanks Zhihai for the information. Yes, setting the RPC timeout at the hadoop common layer will address the issue. For other suggestions, they might be good to have even with RPC timeout. We can open separate jiras if necessary. > Localizer thread might stuck if NM is OOM > ----------------------------------------- > > Key: YARN-2714 > URL: https://issues.apache.org/jira/browse/YARN-2714 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Ming Ma > > When NM JVM runs out of memory; normally it is uncaught exception and the process will exit. But RPC server used by node manager catches OutOfMemoryError to give a chance GC to catch up so NM doesn't need to exit and can recover from OutOfMemoryError situation. > However, in some rare situation when this happens, one of the NM localizer thread didn't get the RPC response from node manager and just waited there. The explanation of why node manager RPC server doesn't respond is because RPC server responder thread swallowed OutOfMemoryError and didn't process outstanding RPC response. On the RPC client side, the RPC timeout is set to 0 and it relies on Ping to detect RPC server availability. > {noformat} > Thread 481 (LocalizerRunner for container_1413487737702_2948_01_013383): > State: WAITING > Blocked count: 27 > Waited count: 84 > Waiting on org.apache.hadoop.ipc.Client$Call@6be5add3 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:503) > org.apache.hadoop.ipc.Client.call(Client.java:1396) > org.apache.hadoop.ipc.Client.call(Client.java:1363) > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > com.sun.proxy.$Proxy36.heartbeat(Unknown Source) > org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62) > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:235) > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169) > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:107) > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:995) > {noformat} > The consequence of this depends on which ContainerExecutor NM uses. If it uses DefaultContainerExecutor, given its startLocalizer method is synchronized, it will blocks other localizer threads. If you use LinuxContainerExecutor, at least other localizer threads can still proceed. But in theory it can slowly drain all available localizer threads. > There are couple ways to fix it. Some of these fixes are complementary. > 1. Fix it at haoop-common layer. It seems RPC server hosted by worker services such ad NM doesn't really need to catch OutOfMemoryError; the service JVM can just exit. Even for the NN and RM, given we have HA, it might be ok to do so. > 2. Set RPC timeout at HadoopYarnProtoRPC layer so that all YARN clients will timeout if RPC server drops the response. > 3. Fix it at yarn localization service. For example, > a) Fix DefaultContainerExecutor so that synchronization isn't required for startLocalizer method. > b) Download executor thread used by ContainerLocalizer currently catches any exceptions. We can fix ContainerLocalizer so that when Download executor thread catches OutOfMemoryError, it can exit its host process. > IMHO, fix it at RPC server layer is better as it addresses other scenarios. Appreciate any input others might have. -- This message was sent by Atlassian JIRA (v6.3.4#6332)