Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 49BC510F14 for ; Tue, 19 Nov 2013 15:36:04 +0000 (UTC) Received: (qmail 35723 invoked by uid 500); 19 Nov 2013 15:35:55 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 35667 invoked by uid 500); 19 Nov 2013 15:35:54 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 35650 invoked by uid 99); 19 Nov 2013 15:35:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 15:35:53 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of solr@elyograg.org designates 166.70.79.219 as permitted sender) Received: from [166.70.79.219] (HELO frodo.elyograg.org) (166.70.79.219) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Nov 2013 15:35:48 +0000 Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id E72C82537 for ; Tue, 19 Nov 2013 08:35:25 -0700 (MST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-type:content-type:in-reply-to :references:subject:subject:mime-version:user-agent:from:from :date:date:message-id:received:received; s=mail; t=1384875325; bh=Q2fjruotHl9x57QJX91qz5Ohswm28AgTFcS+GDwqDrc=; b=hlZC4cQAsRwo umFKzQ1HS43mW3utcQSpv2laroXtPGrL4RVUmmgO6vZhPsMP3U+uvaslWxkJorYq PPevDBCfm49PwXGTEXdIRRe16TBPH4CrdqwFJKUAV15/zYgJ7YNC5EJPtLuTlXaQ tY9K6RdffMRjC/Fv9DCTPzxWDOIPaXo= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id BdXIxybh0jiE for ; Tue, 19 Nov 2013 08:35:25 -0700 (MST) Received: from [192.168.1.101] (101.int.elyograg.org [192.168.1.101]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id 92B4D1DEE for ; Tue, 19 Nov 2013 08:35:25 -0700 (MST) Message-ID: <528B853B.2090309@elyograg.org> Date: Tue, 19 Nov 2013 08:35:23 -0700 From: Shawn Heisey User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: solr-user@lucene.apache.org Subject: Re: Question regarding possibility of data loss References: <1384867121358-4101915.post@n3.nabble.com> In-Reply-To: <1384867121358-4101915.post@n3.nabble.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 11/19/2013 6:18 AM, adfel70 wrote: > Hi, we plan to establish an ensemble of solr with zookeeper. > We gonna have 6 solr servers with 2 instances on each server, also we'll > have 6 shards with replication factor 2, in addition we'll have 3 > zookeepers. You'll want to do one Solr instance per machine. Each Solr instance can house many cores (shard replicas). More than one instance per machine will: 1) Add memory/CPU overhead. 2) Accidentally and easily result in a situation where multiple replicas for a single shard are located on the same machine. > Our concern is that we will send documents to index and solr won't index > them but won't send any error message and we will suffer a data loss > > 1. Is there any situation that can cause this kind of problem? > 2. Can it happen if some of ZKs are down? or some of the solr instances? > 3. How can we monitor them? Can we do something to prevent these kind of > errors? 1) If it does become possible for data loss to occur without notifying your application, it will be considered a very serious bug, and top priority will be given to fixing it. A release with the fix will be made as quickly as possible. Of course I cannot guarantee that such bugs don't exist, but I am not aware of any at the moment. 2) You must have a majority ([n/2] + 1) of zookeepers operational. If you have three or four zookeepers, one zookeeper can be down and SolrCloud will continue to function perfectly. With five or six zookeepers, two can be down. With seven or eight, three can be down. As far as Solr itself, if one replica of each shard from a collection is working, then the entire collection will work. That means you'll want to have at least replicationFactor=2, so there are two copies of each shard. 3) There are MANY options for monitoring. Many of them are completely free, and it is always possible to write your own. One high-level thing you can do is make sure the hosts are up and that they are running the proper number of java processes. Solr offers a number of API entry points that will tell you how things are working, and more are added over time. I don't think there are any zookeeper-specific informational capabilities at the moment, but I did file a bug report asking for the feature. When I have some time, I will work on a fix for it. One of the other committers may decide to work on it as well. If you want out-of-the-box Solr-specific monitoring and are willing to pay for it, Sematext offers SPM. One of Sematext's employees is very active on this list, and they just added Zookeeper monitoring to their capabilities. They do have a free version, but it has extremely limited monitoring history. http://sematext.com/ Thanks, Shawn