Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DFF73200BE2 for ; Thu, 15 Dec 2016 19:37:22 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id DE74F160B15; Thu, 15 Dec 2016 18:37:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B5CAB160B13 for ; Thu, 15 Dec 2016 19:37:21 +0100 (CET) Received: (qmail 79914 invoked by uid 500); 15 Dec 2016 18:37:20 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 79903 invoked by uid 99); 15 Dec 2016 18:37:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Dec 2016 18:37:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 64106180B08 for ; Thu, 15 Dec 2016 18:37:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.98 X-Spam-Level: **** X-Spam-Status: No, score=4.98 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_TIME=3, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 8asSm6siF_46 for ; Thu, 15 Dec 2016 18:37:13 +0000 (UTC) Received: from mail-wj0-f170.google.com (mail-wj0-f170.google.com [209.85.210.170]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 19B785F4ED for ; Thu, 15 Dec 2016 18:37:13 +0000 (UTC) Received: by mail-wj0-f170.google.com with SMTP id v7so74004790wjy.2 for ; Thu, 15 Dec 2016 10:37:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=goWBCMoUv6QaOuHHvF2VR4vYJQ3CgCsguG10LCUdnNc=; b=IrZX5I41JnmgTlkTGxdY3ROAmnvIqN9s7oFxRMlukuC2NA5XYPLOmSb+e3IhtJiuFZ 7ys9/+zXMBvOABf/V5OsnK8Jc9YngMwxgi9MvVHwRA+NQ/xsmAzI6Xtv8J3XA9MC0wBR sfvBl0hrryBQU8WoEYUXVk9duLGzDKVSuDVdQy9UzJHwThoehTnLDSawbn2yaK/Q8fUd Hy9vVHaehzt4270aXRiIzyhbTdqW6YjjH42sr3ECj5vWRdQ5WiTSyflo0/tnUE2SoApw JlOOWEVqnpw4jjkHoWQ40pMsA2ilpOplvwIwyEmR7wyLAyjZPiuAJgvJMqVI7sX8ntjC 3qXQ== X-Gm-Message-State: AKaTC01gL9/Pnwq9JpoA6GkEpFroM3t8bCyw2smBqbcmDVmgoZnxErAhaHDioVBIu0lXrUpZrywmIfkrhdE2a5UOTe2hTpQ4caFAiPKSOLk5zLynJ3XXO91CLTRYYNDiP+FtPuiih7IxU1BLCU2O8H6k9KA= X-Received: by 10.194.248.5 with SMTP id yi5mr2816508wjc.11.1481827016081; Thu, 15 Dec 2016 10:36:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.194.165.103 with HTTP; Thu, 15 Dec 2016 10:36:55 -0800 (PST) In-Reply-To: References: From: Webster Homer Date: Thu, 15 Dec 2016 12:36:55 -0600 Message-ID: Subject: Re: Solr Cloud Replica Cores Give different Results for the Same query To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a1136a156e9af710543b6bca4 archived-at: Thu, 15 Dec 2016 18:37:23 -0000 --001a1136a156e9af710543b6bca4 Content-Type: text/plain; charset=UTF-8 I am trying to find the reported inconsistencies now. The timestamp I have was created by our ETL process, which may not be in exactly the same order as the indexing occurred When I tried to sort the results by _docid_ desc, solr through a 500 error: { "responseHeader":{ "zkConnected":true, "status":500, "QTime":7, "params":{ "q":"*:*", "indent":"on", "fl":"record_spec,s_id,pid,search_concat_pno, search_pno, search_user_term, search_lform, search_eform, search_acronym, search_synonyms, root_name, search_s_pri_name, search_p_pri_name, search_keywords, lookahead_search_terms, sortkey, search_rtecs, search_chem_comp, cas_number, search_component_cas, search_beilstein, search_color_idx, search_ecnumber, search_femanumber, search_isbn, search_mdl_number, search_descriptions, page_title, search_xref_comparable_pno, search_xref_comparable_sku, search_xref_equivalent_pno, search_xref_exact_pno create_date search_xref_exact_sku, score", "sort":"_docid_ desc", "rows":"20", "wt": "json", "_":"1481821047026"}}, "error":{ "msg":"Index: 1, Size: 0", "trace":"java.lang.IndexOutOfBoundsException: Index: 1, Size: 0\n\tat java.util.ArrayList.rangeCheck(ArrayList.java:653)\n\tat java.util.ArrayList.get(ArrayList.java:429)\n\tat org.apache.solr.common.util.NamedList.getVal(NamedList.java:174)\n\tat org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardFieldSortedHitQueue.java:146)\n\tat org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:167)\n\tat org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:159)\n\tat org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:91)\n\tat org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:33)\n\tat org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158)\n\tat org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:1098)\n\tat org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:758)\n\tat org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:737)\n\tat org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:428)\n\tat org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)\n\tat org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)\n\tat org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)\n\tat org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\tat org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat java.lang.Thread.run(Thread.java:745)\n", "code":500}} On Wed, Dec 14, 2016 at 7:41 PM, Erick Erickson wrote: > Let's back up a bit. You say "This seems to cause two replicas to > return different hits depending upon which one is queried." > > OK, _how_ are they different? I've been assuming different numbers of > hits. If you're getting the same number of hits but different document > ordering, that's a completely different issue and may be easily > explainable. If this is true, skip the rest of this message. I only > realized we may be using a different definition of "different hits" > part way through writing this reply. > > ------------------------ > > Having the timestamp as a string isn't a problem, you can do something > very similar with wildcards and the like if it's a string that sorts > the same way the timestamp would. And it's best if it's created > upstream anyway that way it's guaranteed to be the same for the doc on > all replicas. > > If the date is in canonical form (YYYY-MM-DDTHH:MM:SSZ) then a simple > copyfield to a date field would do the trick. > > But there's no real reason to do any of that. Given that you see this > when there's no indexing going on then there's no point to those > tests, those were just for a way to examine your nodes while there was > active indexing. > > How do you fix this problem when you see it? If it goes away by itself > that would gives at least a start on where to look. If you have to > manually intervene it would be good to know what you do. > > The CDCR pattern is docs to from the leader on the source cluster to > the leader on the target cluster. Once the target leader gets the > docs, it's supposed to send the doc to all the replicas. > > To try to narrow down the issue, next time it occurs can you look at > _both_ the source and target clusters and see if they _both_ show the > same discrepancy? What I'm looking for is whether both are > self-consistent. That is, all the replicas for shardN on the source > cluster show the same documents (M). All the replicas for shardN on > the target cluster show the same number of docs (N). I'm not as > concerned if M != N at this point. Note I'm looking at the number of > hits here, not say the document ordering. > > To do this you'll have to do the trick I mentioned where you query > each replica separately. > > And are you absolutely sure that your different results are coming > from the _same_ cluster? If you're comparing a query from the source > cluster with a query from the target cluster, that's different than if > the queries come from the same cluster. > > Best, > Erick > > On Wed, Dec 14, 2016 at 2:48 PM, Webster Homer > wrote: > > Thanks for the quick feedback. > > > > We are not doing continuous indexing, we do a complete load once a week > and > > then have a daily partial load for any documents that have changed since > > the load. These partial loads take only a few minutes every morning. > > > > The problem is we see this discrepancy long after the data load > completes. > > > > We have a source collection that uses cdcr to replicate to the target. I > > see the current=false setting in both the source and target collections. > > Only the target collection is being heavily searched so that is where my > > concern is. So what could cause this kind of issue? > > Do we have a configuration problem? > > > > It doesn't happen all the time, so I don't currently have a reproducible > > test case, yet. > > > > I will see about adding the timestamp, we have one, but it was created > as a > > string, and was generated by our ETL job > > > > On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson > > > wrote: > > > >> The commit points on different replicas will trip at different wall > >> clock times so the leader and replica may return slightly different > >> results depending on whether doc X was included in the commit on one > >> replica but not on the second. After the _next_ commit interval (2 > >> seconds in your case), doc X will be committed on the second replica: > >> that is it's not lost. > >> > >> Here's a couple of ways to verify: > >> > >> 1> turn off indexing and wait a few seconds. The replicas should have > >> the exact same documents. "A few seconds" is your autocommit (soft in > >> your case) interval + autowarm time. This last is unknown, but you can > >> check your admin/plugins-stats search handler times, it's reported > >> there. Now issue your queries. If the replicas don't report the same > >> docs A Bad Thing that should be worrying. BTW, with a 2 second soft > >> commit interval, which is really aggressive, you _better not_ have > >> very large autowarm intervals! > >> > >> 2> Include a timestamp in your docs when they are indexed. There's an > >> automatic way to do that BTW.... now do your queries and append an FQ > >> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas > >> should have the same counts unless you are deleting documents. I > >> mention deletes on the off chance that you're deleting documents that > >> fall in the interval and then the same as above could theoretically > >> occur. Updates should be fine. > >> > >> BTW, I've seen continuous monitoring of this done by automated > >> scripts. The key is to get the shard URL and ping that with > >> &distrib=false. It'll look something like > >> http://host:port/solr/collection_shard1_replica1.... People usually > >> just use *:* and compare numFound. > >> > >> Best, > >> Erick > >> > >> > >> > >> On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer > >> wrote: > >> > We are using Solr Cloud 6.2 > >> > > >> > We have been noticing an issue where the index in a core shows as > >> current = > >> > false > >> > > >> > We have autocommit set for 15 seconds, and soft commit at 2 seconds > >> > > >> > This seems to cause two replicas to return different hits depending > upon > >> > which one is queried. > >> > > >> > What would lead to the indexes not being "current"? The documentation > on > >> > the meaning of current is vague. > >> > > >> > The collections in our cloud have two shards each with two replicas. I > >> see > >> > this with several of the collections. > >> > > >> > We don't know how they get like this but it's troubling > >> > > >> > -- > >> > > >> > > >> > This message and any attachment are confidential and may be > privileged or > >> > otherwise protected from disclosure. If you are not the intended > >> recipient, > >> > you must not copy this message or attachment or disclose the contents > to > >> > any other person. If you have received this transmission in error, > please > >> > notify the sender immediately and delete the message and any > attachment > >> > from your system. Merck KGaA, Darmstadt, Germany and any of its > >> > subsidiaries do not accept liability for any omissions or errors in > this > >> > message which may arise as a result of E-Mail-transmission or for > damages > >> > resulting from any unauthorized changes of the content of this message > >> and > >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > >> > subsidiaries do not guarantee that this message is free of viruses and > >> does > >> > not accept liability for any damages caused by any virus transmitted > >> > therewith. > >> > > >> > Click http://www.merckgroup.com/disclaimer to access the German, > French, > >> > Spanish and Portuguese versions of this disclaimer. > >> > > > > -- > > > > > > This message and any attachment are confidential and may be privileged or > > otherwise protected from disclosure. If you are not the intended > recipient, > > you must not copy this message or attachment or disclose the contents to > > any other person. If you have received this transmission in error, please > > notify the sender immediately and delete the message and any attachment > > from your system. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not accept liability for any omissions or errors in this > > message which may arise as a result of E-Mail-transmission or for damages > > resulting from any unauthorized changes of the content of this message > and > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not guarantee that this message is free of viruses and > does > > not accept liability for any damages caused by any virus transmitted > > therewith. > > > > Click http://www.merckgroup.com/disclaimer to access the German, French, > > Spanish and Portuguese versions of this disclaimer. > -- This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer. --001a1136a156e9af710543b6bca4--