Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6777B10485 for ; Thu, 3 Sep 2015 06:35:49 +0000 (UTC) Received: (qmail 16587 invoked by uid 500); 3 Sep 2015 06:35:46 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 16519 invoked by uid 500); 3 Sep 2015 06:35:46 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 16507 invoked by uid 99); 3 Sep 2015 06:35:44 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Sep 2015 06:35:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 2E97FC0E10 for ; Thu, 3 Sep 2015 06:35:44 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.108 X-Spam-Level: X-Spam-Status: No, score=-0.108 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RP_MATCHES_RCVD=-0.006, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id qt2GbBAowaxj for ; Thu, 3 Sep 2015 06:35:42 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.219]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTP id 5BE9D20752 for ; Thu, 3 Sep 2015 06:35:42 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id E05927239 for ; Thu, 3 Sep 2015 00:35:34 -0600 (MDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-type:content-type:in-reply-to :mime-version:user-agent:date:date:message-id:from:from :references:subject:subject:received:received; s=mail; t= 1441262134; bh=nxC+fnHQSbtVvjv2b6iodBQzaokJ+G77UKLZuBfEAqw=; b=b /05NQbTidVk7hFi3biwyx+H0Ji86SvPL4P4r3211vBX8iTqTbh7+Cw/mMpxnSalx +7Mux8N7hta8tZfOTdUy8bhOfajFVvpS9uwLGIFF9Wc9s/xHox0w2gJsmB/flous 14uOJGOVW5U5SuhQYmKkZ5v0FeBsfxb7Y4U+y1lsj0= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id MaIIG3WMIY9a for ; Thu, 3 Sep 2015 00:35:34 -0600 (MDT) Received: from [192.168.1.107] (107.int.elyograg.org [192.168.1.107]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id A4B6B7224 for ; Thu, 3 Sep 2015 00:35:34 -0600 (MDT) Subject: Re: Socket \ Connection Timeout Values To: solr-user@lucene.apache.org References: From: Shawn Heisey X-Enigmail-Draft-Status: N1110 Message-ID: <55E7EA36.5030207@elyograg.org> Date: Thu, 3 Sep 2015 00:35:34 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 9/3/2015 12:06 AM, Arnon Yogev wrote: > I wanted to ask about the implications of different timeout values one can > use. > > For example: > From what I see in the code, the default socket timeout value for Solr is > 0. > Does that mean Solr nodes will wait to update \ receive update from each > other without any timeout? The socket timeout is a property of the TCP connection, which is ultimately handled by the operating system. Solr uses HTTP, which is a TCP-based protocol. This is not specific to Solr. A value of zero means the operating system won't time out and disconnect the TCP session. Generally you want your servers to have no socket timeout, and depending on exactly what you are doing, *maybe* you will configure a socket timeout on the client side. For zookeeper, there is no need to have a socket timeout, as you will see when I continue below. > In other words, can the following scenario happen: > 1. One solr node becomes very slow for some reason, but is still > considered alive in ZK. > 2. Other servers in the cluster try to update \ receive updates from this > node, but do not get responds. > 3. Since there's no timeout defined, all nodes in the cluster will > eventually become unresponsive (when the thread pool is exhausted). Even though the socket timeout is generally zero so the OS won't terminate idle TCP connections, the application can take care of timeouts and terminations. Solr configures a zkClientTimeout. If I remember my last dive into SolrCloud code correctly, this is transferred pretty much straight across to the zookeeper client as its session timeout. If this timeout is exceeded on pretty much any inter-server communication, SolrCloud will generally mark the node down. Historically there have been a lot of problems with SolrCloud nodes being marked down due to garbage collection pauses that exceed the timeout. Since 5.0 this should be less of a problem, because the included start scripts have aggressive GC tuning. The zkClientTimeout defaults to 15 seconds internally inside Solr if you do not have any configuration that sets the value, but most recent Solr example configurations set it to 30 seconds. In most situations, a 15 second timeout is VERY long ... if that's being exceeded, there is usually a serious problem that needs fixing. Thanks, Shawn