Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 13455 invoked from network); 7 Feb 2008 19:51:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Feb 2008 19:51:33 -0000 Received: (qmail 45879 invoked by uid 500); 7 Feb 2008 19:51:24 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 45850 invoked by uid 500); 7 Feb 2008 19:51:24 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 45841 invoked by uid 99); 7 Feb 2008 19:51:24 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2008 11:51:24 -0800 X-ASF-Spam-Status: No, hits=-1998.0 required=10.0 tests=ALL_TRUSTED,URIBL_BLACK X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2008 19:51:16 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2959671403C for ; Thu, 7 Feb 2008 11:51:08 -0800 (PST) Message-ID: <18662271.1202413868166.JavaMail.jira@brutus> Date: Thu, 7 Feb 2008 11:51:08 -0800 (PST) From: "Raghu Angadi (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-2789) Race condition in ipc.Server prevents responce being written back to client. In-Reply-To: <3571230.1202335290995.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566759#action_12566759 ] Raghu Angadi commented on HADOOP-2789: -------------------------------------- I am still thinking of how to handle Server channels and selectors correctly. A few of the things involved : # channel is polled on different selectors : Listener and Responder. # channel could be closed by any of the server threads, including the IPC handlers. closing a channel immediately cancels its keys with both the selectors. # Listener registers channel with readSelector once. But Responder might register and cancel the channel with writeSelector multiple times. And this registration might happen from any of the handlers. This registration should synchronize correctly with close() from another thread. Ideal requirement is that all these should behave correctly (ie no unexpected close etc) and efficiently. hmm will see. I might be able to attach a simpler patch that just fixes the bug seen here, if that is enough. > Race condition in ipc.Server prevents responce being written back to client. > ---------------------------------------------------------------------------- > > Key: HADOOP-2789 > URL: https://issues.apache.org/jira/browse/HADOOP-2789 > Project: Hadoop Core > Issue Type: Bug > Components: ipc > Affects Versions: 0.16.0 > Reporter: Clint Morgan > Assignee: Raghu Angadi > Priority: Critical > Fix For: 0.16.1 > > Attachments: failure-with-patch.log, failure.log, HADOOP-2789.patch, success.log > > > I encountered a race condition in ipc.Server when writing the response > back to the socket. Sometimes the write SelectKey is being canceled > when it should not be, and thus the full response never gets > written. This results in clients timing out on the socket while waiting for the response. > I am attaching a unit test that demonstrates the problem. It follows > closely the TestIPC test, however the socket output buffer is set > smaller than the result being sent back, so that partial writes > occur. I also put random sleep in the client to help provoke the race > condition. > On my machine this fails over half of the time. > Looking at the code in ipc.Server.java. The problem is manifested in > Responder.doAsyncWrite(). If I comment out the key.cancel() line, then > everything works fine. > So we need to identify when to safely cancel the key. > I tried the following: > {noformat} > private void doAsyncWrite(SelectionKey key) throws IOException { > Call call = (Call)key.attachment(); > if (call == null) { > return; > } > if (key.channel() != call.connection.channel) { > throw new IOException("doAsyncWrite: bad channel"); > } > if (processResponse(call.connection.responseQueue)) { > synchronized(call.connection.responseQueue) { > if (call.connection.responseQueue.size() == 0) { > LOG.info("Cancelling key for call "+call.toString()+ " key: "+ key.toString()); > key.cancel(); // remove item from selector. > } else { > LOG.warn("NOT REALLY DONE: "+call.toString()+ " key: "+ key.toString()); > } > } > } > } > {noformat} > And this does catch some of the cases (EG, the LOG.warn message gets hit), but i still hit the race condition. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.