Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 31E55200C3F for ; Wed, 22 Mar 2017 21:10:47 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 307AC160B86; Wed, 22 Mar 2017 20:10:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 51C76160B74 for ; Wed, 22 Mar 2017 21:10:46 +0100 (CET) Received: (qmail 31409 invoked by uid 500); 22 Mar 2017 20:10:45 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 31393 invoked by uid 99); 22 Mar 2017 20:10:45 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Mar 2017 20:10:45 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DDB581A0043 for ; Wed, 22 Mar 2017 20:10:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -98.549 X-Spam-Level: X-Spam-Status: No, score=-98.549 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_NEUTRAL=0.652, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id yxX720hKYD54 for ; Wed, 22 Mar 2017 20:10:43 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 9DCCB5FB40 for ; Wed, 22 Mar 2017 20:10:42 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id C25CAE002B for ; Wed, 22 Mar 2017 20:10:41 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 808B0254DB for ; Wed, 22 Mar 2017 20:10:41 +0000 (UTC) Date: Wed, 22 Mar 2017 20:10:41 +0000 (UTC) From: "Mingliang Liu (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-14214) DomainSocketWatcher::add()/delete() should not self interrupt while looping await() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 22 Mar 2017 20:10:47 -0000 [ https://issues.apache.org/jira/browse/HADOOP-14214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937039#comment-15937039 ] Mingliang Liu commented on HADOOP-14214: ---------------------------------------- Ping [~cmccabe], [~jnp], [~arpitagarwal] for discussion. > DomainSocketWatcher::add()/delete() should not self interrupt while looping await() > ----------------------------------------------------------------------------------- > > Key: HADOOP-14214 > URL: https://issues.apache.org/jira/browse/HADOOP-14214 > Project: Hadoop Common > Issue Type: Bug > Reporter: Mingliang Liu > Assignee: Mingliang Liu > > Our hive team found a TPCDS job whose queries running on LLAP seem to be getting stuck. Dozens of threads were waiting for the {{DfsClientShmManager::lock}}, as following jstack: > {code} > Thread 251 (IO-Elevator-Thread-5): > State: WAITING > Blocked count: 3871 > Wtaited count: 4565 > Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@16ead198 > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:255) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:111) > org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166) > org.apache.hadoop.hive.llap.io.metadata.OrcStripeMetadata.(OrcStripeMetadata.java:64) > org.apache.hadoop.hive.llap.io.encoded.OrcEncodedDataReader.readStripesMetadata(OrcEncodedDataReader.java:622) > {code} > The thread that is expected to signal those threads is calling {{DomainSocketWatcher::add()}} method, but it gets stuck there dealing with InterruptedException infinitely. The jstack is like: > {code} > Thread 44417 (TezTR-257387_2840_12_10_52_0): > State: RUNNABLE > Blocked count: 3 > Wtaited count: 5 > Stack: > java.lang.Throwable.fillInStackTrace(Native Method) > java.lang.Throwable.fillInStackTrace(Throwable.java:783) > java.lang.Throwable.(Throwable.java:250) > java.lang.Exception.(Exception.java:54) > java.lang.InterruptedException.(InterruptedException.java:57) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2034) > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:325) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager$EndpointShmManager.allocSlot(DfsClientShmManager.java:266) > org.apache.hadoop.hdfs.shortcircuit.DfsClientShmManager.allocSlot(DfsClientShmManager.java:434) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.allocShmSlot(ShortCircuitCache.java:1017) > org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:476) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:784) > org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:718) > org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422) > org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333) > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1181) > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1118) > org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1478) > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1441) > org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121) > {code} > The whole job makes no progress because of this. > The thread in {{DomainSocketWatcher::add()}} is expected to eventually break the while loop where it waits for the newly added entry being deleted by another thread. However, if this thread is ever interrupted, chances are that it will hold the lock forever so {{if(!toAdd.contains(entry))}} always be false. > {code:title=DomainSocketWatcher::add()} > public void add(DomainSocket sock, Handler handler) { > lock.lock(); > try { > ...... > toAdd.add(entry); > kick(); > while (true) { > try { > processedCond.await(); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > if (!toAdd.contains(entry)) { > break; > } > } > } finally { > lock.unlock(); > } > } > {code} > The reason here is that, this method catches the InterruptedException and self interrupts during await(). The await() method internally calls {{AbstractQueuedSynchronizer::await()}}, which will throw a new InterruptedException if it's interrupted. > {code:title=AbstractQueuedSynchronizer::await()} > public final void await() throws InterruptedException { > if (Thread.interrupted()) > throw new InterruptedException(); > Node node = addConditionWaiter(); > ... > {code} > Our code in {{DomainSocketWatcher::add()}} catches this exception (again) and self interrupt (again). Please note in this process, the associated lock is never released so that the other thread which is supposed to make {{if(!toAdd.contains(entry))}} be true is still pending on the lock. > The {{DomainSocketWatcher::delete()} has similar code logic and should suffer from similar problems. > Thanks [~jdere] for testing and reporting this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org