From issues-return-5979-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Tue Dec 3 17:32:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B05DC18065B for ; Tue, 3 Dec 2019 18:32:03 +0100 (CET) Received: (qmail 90616 invoked by uid 500); 3 Dec 2019 17:32:03 -0000 Mailing-List: contact issues-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list issues@lucene.apache.org Received: (qmail 90606 invoked by uid 99); 3 Dec 2019 17:32:03 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2019 17:32:03 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 17F9BE300D for ; Tue, 3 Dec 2019 17:32:02 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 7F1FF780552 for ; Tue, 3 Dec 2019 17:32:00 +0000 (UTC) Date: Tue, 3 Dec 2019 17:32:00 +0000 (UTC) From: "Andrzej Bialecki (Jira)" To: issues@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (SOLR-13975) ConcurrentUpdateSolrClient connection stall prevention MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987097#comment-16987097 ] Andrzej Bialecki edited comment on SOLR-13975 at 12/3/19 5:31 PM: ------------------------------------------------------------------ This patch adds a stall detection logic to the key methods in {{ConcurrentUpdateSolrClient}} as well as {{ConcurrentUpdateHttp2SolrClient}}. I wasn't sure what timeout value to use - in CUSC I used {{connectionTimeout}} and in CUH2SC I used {{client.getIdleTime().}} was (Author: ab): This patch adds a stall detection logic to the key methods in {{ConcurrentUpdateSolrClient}} as well as {{ConcurrentUpdateHttp2SolrClient}}. I wasn't sure what timeout value to use - in CUSC I used {{connectionTimeout}} and in CUSH2C I used {{client.getIdleTime().}} > ConcurrentUpdateSolrClient connection stall prevention > ------------------------------------------------------ > > Key: SOLR-13975 > URL: https://issues.apache.org/jira/browse/SOLR-13975 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 8.3, 8.4 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Fix For: 8.4 > > Attachments: SOLR-13975.patch > > > When a Solr process, which hosts replicas of a collection, is suspended - that is, the OS process is suspended using eg. {{kill -STOP }} - a long stall may occur in CUSC until a socket timeout is reached. > During this stall updates from the leader are not forwarded to any replica, even though other replicas are still active and can receive updates. If the sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled because the leader stops processing updates, too. > This situation is caused by several issues: > * when a process is suspended its sockets remain open - so there is no immediate disconnect as if the process died, but the process becomes unresponsive. Eventually, a socket timeout will be reached (distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is set to 10 min. During this time all indexing to that shard will be stuck. > * there are several infinite {{for}} loops in CUSC (eg. in {{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which rely either on the relatively quick success of the call or an exception to be thrown. However, in this situation neither happens quickly - the call is stuck waiting for the remote end until soTimeout expires. > This issue proposes to add a stall prevention logic, which would break these infinite loops long before the socket timeout occurs based on the progress of the queue processing. > This is a follow-up to SOLR-13896. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org For additional commands, e-mail: issues-help@lucene.apache.org