Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 250E7D844 for ; Fri, 18 Jan 2013 17:16:14 +0000 (UTC) Received: (qmail 58819 invoked by uid 500); 18 Jan 2013 17:16:13 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 58789 invoked by uid 500); 18 Jan 2013 17:16:13 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 58780 invoked by uid 99); 18 Jan 2013 17:16:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Jan 2013 17:16:13 +0000 Date: Fri, 18 Jan 2013 17:16:13 +0000 (UTC) From: "Suresh Srinivas (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-9229) IPC: Retry on connection reset or socket timeout during SASL negotiation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-9229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557360#comment-13557360 ] Suresh Srinivas commented on HADOOP-9229: ----------------------------------------- bq. we should allow retry when connection reset or socket timeout happens in this stage. I know in large clusters, it is possible to hit a condition where too many clients connect to the master servers such as namenode and overload it. The question is how do we want to handle this condition. There are two possible way to look at the solution: # The overload condition is unexpected, hence the current behavior of degraded service where clients get disconnected could be the right behavior. # If the load is some thing that namenode should handle, hence not an overload condition, we should look at scaling number of connections at the namenode. There are things that can be tuned here - number of RPC handlers, queue depth per RPC handler etc. If that is not sufficient, we may have to make further changes to scale connection handling. One concern I have with retry is - if you have overload condition which results in client getting dropped, retry will continue the overload condition for a longer duration and make the situation worse. > IPC: Retry on connection reset or socket timeout during SASL negotiation > ------------------------------------------------------------------------ > > Key: HADOOP-9229 > URL: https://issues.apache.org/jira/browse/HADOOP-9229 > Project: Hadoop Common > Issue Type: Improvement > Components: ipc > Affects Versions: 3.0.0, 2.0.3-alpha, 0.23.7 > Reporter: Kihwal Lee > > When an RPC server is overloaded, incoming connections may not get accepted in time, causing listen queue overflow. The impact on client varies depending on the type of OS in use. On Linux, connections in this state look fully connected to the clients, but they are without buffers, thus any data sent to the server will get dropped. > This won't be a problem for protocols where client first wait for server's greeting. Even for clients-speak-first protocols, it will be fine if the overload is transient and such connections are accepted before the retransmission of dropped packets arrive. Otherwise, clients can hit socket timeout after several retransmissions. In certain situations, connection will get reset while clients still waiting for ack. > We have seen this happening to IPC clients during SASL negotiation. Since no call has been sent, we should allow retry when connection reset or socket timeout happens in this stage. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira