Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A054317D15 for ; Mon, 13 Apr 2015 22:11:18 +0000 (UTC) Received: (qmail 7317 invoked by uid 500); 13 Apr 2015 22:11:13 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 7260 invoked by uid 500); 13 Apr 2015 22:11:13 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 7248 invoked by uid 99); 13 Apr 2015 22:11:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Apr 2015 22:11:13 +0000 Date: Mon, 13 Apr 2015 22:11:13 +0000 (UTC) From: "Eric Payne (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-11802) DomainSocketWatcher#watcherThread can encounter IllegalStateException in finally block when calling sendCallback MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493182#comment-14493182 ] Eric Payne commented on HADOOP-11802: ------------------------------------- Thanks again, [~cmccabe], for your comments and taking time on this issue. One thing to note is that just prior to these problems, a 195-second GC was taking place on the DN. I added a catch of {{Throwable}} in the main thread of the {{DomainSocketWatcher}} and reproduced the problem. AFAICT, the following represents what is happening: - Request for short circuit read is received - {{DataXceiver#requestShortCircuitShm}} calls {{ShortCircuitRegistry#createNewMemorySegment}}, which creates a shared memory segment and associates it with the passed domain socket in the {{DomainSocketWatcher}}. Then, in that thread, {{createNewMemorySegment}} waits on that socket/shm entry in {{DomainSocketWatcher#add}}. {code} public NewShmInfo createNewMemorySegment(String clientName, ... watcher.add(sock, shm); ... {code} - It's at this point that things get confusing, and I'm still working on why this happens. The wait wakes up, but things are not normal, but it wasn't woken up because of an exception, either. You can tell that no exception was thrown inside {{createNewMemorySegment}} to wake it up because the following code goes on to call {{sendShmSuccessRespons}}, which is where the next bad thing happens: {code} public void requestShortCircuitShm(String clientName) throws IOException { ... try { shmInfo = datanode.shortCircuitRegistry. createNewMemorySegment(clientName, sock); // After calling #{ShortCircuitRegistry#createNewMemorySegment}, the // socket is managed by the DomainSocketWatcher, not the DataXceiver. releaseSocket(); } catch (UnsupportedOperationException e) { sendShmErrorResponse(ERROR_UNSUPPORTED, "This datanode has not been configured to support " + "short-circuit shared memory segments."); return; } catch (IOException e) { sendShmErrorResponse(ERROR, "Failed to create shared file descriptor: " + e.getMessage()); return; } sendShmSuccessResponse(sock, shmInfo); ... {code} - At this point, the call to {{sendShmSuccessResponse}} gets an exception: {noformat} 2015-04-04 13:12:30,973 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO DataNode.clienttrace: cliID: DFSClient_attempt_1427231924849_569269_m_002116_0_-161414780_1, src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: a2d3bac0-e98b-4b73-a5a1-82c7eb557a7a, success: false 2015-04-04 13:12:30,984 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] ERROR datanode.DataNode: host.domain.com:1004:DataXceiver error processing REQUEST_SHORT_CIRCUIT_SHM operation src: unix:/home/gs/var/run/hdfs/dn_socket dst: java.net.SocketException: write(2) error: Broken pipe at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:601) at com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833) at com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843) at com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:380) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:418) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) at java.lang.Thread.run(Thread.java:722) {noformat} - At this point, it bubbles back up to {{DataXceiver#requestShortCircuitShm}}, which cleans up, closing the socket: {code} ... if ((!success) && (peer == null)) { // If we failed to pass the shared memory segment to the client, // close the UNIX domain socket now. This will trigger the // DomainSocketWatcher callback, cleaning up the segment. IOUtils.cleanup(null, sock); } {code} - Then, the main {{DomainSocketWatcher}} thread wakes up (after regular timeout interval has expired), and tries to call {{sendCallbackAndRemove}}, which encounters the following {{IllegalArgumentException}}: {code} final Thread watcherThread = new Thread(new Runnable() { ... while (true) { lock.lock(); try { for (int fd : fdSet.getAndClearReadableFds()) { sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet, fd); } ... {code} {noformat} ERROR unix.DomainSocketWatcher: org.apache.hadoop.net.unix.DomainSocketWatcher$2@76845081 terminating on Throwable java.lang.IllegalArgumentException: DomainSocketWatcher(103231254): file descriptor 249 was closed while still in the poll(2) loop. at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:421) at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallbackAndRemove(DomainSocketWatcher.java:448) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$500(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:470) at java.lang.Thread.run(Thread.java:745) {noformat} I welcome your suggestions and look forward to your feedback. Thanks, Eric > DomainSocketWatcher#watcherThread can encounter IllegalStateException in finally block when calling sendCallback > ---------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-11802 > URL: https://issues.apache.org/jira/browse/HADOOP-11802 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 2.7.0 > Reporter: Eric Payne > Assignee: Eric Payne > > In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the call to {{sendCallback}} can encounter an {{IllegalStateException}}, and leave some cleanup tasks undone. > {code} > } finally { > lock.lock(); > try { > kick(); // allow the handler for notificationSockets[0] to read a byte > for (Entry entry : entries.values()) { > // We do not remove from entries as we iterate, because that can > // cause a ConcurrentModificationException. > sendCallback("close", entries, fdSet, entry.getDomainSocket().fd); > } > entries.clear(); > fdSet.close(); > } finally { > lock.unlock(); > } > } > {code} > The exception causes {{watcherThread}} to skip the calls to {{entries.clear()}} and {{fdSet.close()}}. > {code} > 2015-04-02 11:48:09,941 [DataXceiver for client unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false > 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: Thread[Thread-14,5,main] terminating on unexpected exception > java.lang.IllegalStateException: failed to remove b845649551b6b1eab5c17f630e42489d > at com.google.common.base.Preconditions.checkState(Preconditions.java:145) > at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119) > at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102) > at org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402) > at org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52) > at org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522) > at java.lang.Thread.run(Thread.java:722) > {code} > Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or HADOOP-10404. The cluster installation is running code with all of these fixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)