Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
In-Reply-To: <CAN4YXvdHN3oRsg_C5FGhM8E8smkafKveZYXCLm93D1tRwXGaGA@mail.gmail.com>
References: <CAKTEqfkTyRmO4iGGLe+VG3s6sD+-_B0JLK94_Dca8=Qe-uLYig@mail.gmail.com>
 <CAN4YXvfb5eOWRy030QKu-VRpiF+=y=9iesoQs49ZgJCfrxixug@mail.gmail.com>
 <CAKTEqfm79-sY8=Ye5Aywt96PimDf1weO1Vu=zqS52aUhNPpZzQ@mail.gmail.com> <CAN4YXvdHN3oRsg_C5FGhM8E8smkafKveZYXCLm93D1tRwXGaGA@mail.gmail.com>
From: Webster Homer <webster.homer@sial.com>
Date: Thu, 15 Dec 2016 12:36:55 -0600
Message-ID: <CAKTEqfmAETHsPyMELT8iRrH7Wc0Uc7SFBEK0RDe+=wH8xUSKhg@mail.gmail.com>
Subject: Re: Solr Cloud Replica Cores Give different Results for the Same query
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary=001a1136a156e9af710543b6bca4
archived-at: Thu, 15 Dec 2016 18:37:23 -0000

--001a1136a156e9af710543b6bca4
Content-Type: text/plain; charset=UTF-8

I am trying to find the reported inconsistencies now.

The timestamp I have was created by our ETL process, which may not be in
exactly the same order as the indexing occurred

When I tried to sort the results by _docid_ desc, solr through a 500 error:
{ "responseHeader":{ "zkConnected":true, "status":500, "QTime":7, "params":{
"q":"*:*", "indent":"on", "fl":"record_spec,s_id,pid,search_concat_pno,
search_pno, search_user_term, search_lform, search_eform, search_acronym,
search_synonyms, root_name, search_s_pri_name, search_p_pri_name,
search_keywords, lookahead_search_terms, sortkey, search_rtecs,
search_chem_comp, cas_number, search_component_cas, search_beilstein,
search_color_idx, search_ecnumber, search_femanumber, search_isbn,
search_mdl_number, search_descriptions, page_title,
search_xref_comparable_pno, search_xref_comparable_sku,
search_xref_equivalent_pno, search_xref_exact_pno create_date
search_xref_exact_sku, score", "sort":"_docid_ desc", "rows":"20", "wt":
"json", "_":"1481821047026"}}, "error":{ "msg":"Index: 1, Size: 0",
"trace":"java.lang.IndexOutOfBoundsException:
Index: 1, Size: 0\n\tat
java.util.ArrayList.rangeCheck(ArrayList.java:653)\n\tat
java.util.ArrayList.get(ArrayList.java:429)\n\tat
org.apache.solr.common.util.NamedList.getVal(NamedList.java:174)\n\tat
org.apache.solr.handler.component.ShardFieldSortedHitQueue$ShardComparator.sortVal(ShardFieldSortedHitQueue.java:146)\n\tat
org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:167)\n\tat
org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:159)\n\tat
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:91)\n\tat
org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:33)\n\tat
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158)\n\tat
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:1098)\n\tat
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:758)\n\tat
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:737)\n\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:428)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:154)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:2089)\n\tat
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:652)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:459)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:208)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:518)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)\n\tat
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)\n\tat
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)\n\tat
java.lang.Thread.run(Thread.java:745)\n", "code":500}}

On Wed, Dec 14, 2016 at 7:41 PM, Erick Erickson <erickerickson@gmail.com>
wrote:

> Let's back up a bit. You say "This seems to cause two replicas to
> return different hits depending upon which one is queried."
>
> OK, _how_ are they different? I've been assuming different numbers of
> hits. If you're getting the same number of hits but different document
> ordering, that's a completely different issue and may be easily
> explainable. If this is true, skip the rest of this message. I only
> realized we may be using a different definition of "different hits"
> part way through writing this reply.
>
> ------------------------
>
> Having the timestamp as a string isn't a problem, you can do something
> very similar with wildcards and the like if it's a string that sorts
> the same way the timestamp would. And it's best if it's created
> upstream anyway that way it's guaranteed to be the same for the doc on
> all replicas.
>
> If the date is in canonical form (YYYY-MM-DDTHH:MM:SSZ) then a simple
> copyfield to a date field would do the trick.
>
> But there's no real reason to do any of that. Given that you see this
> when there's no indexing going on then there's no point to those
> tests, those were just for a way to examine your nodes while there was
> active indexing.
>
> How do you fix this problem when you see it? If it goes away by itself
> that would gives at least a start on where to look. If you have to
> manually intervene it would be good to know what you do.
>
> The CDCR pattern is docs to from the leader on the source cluster to
> the leader on the target cluster. Once the target leader gets the
> docs, it's supposed to send the doc to all the replicas.
>
> To try to narrow down the issue, next time it occurs can you look at
> _both_ the source and target clusters and see if they _both_ show the
> same discrepancy? What I'm looking for is whether both are
> self-consistent. That is, all the replicas for shardN on the source
> cluster show the same documents (M). All the replicas for shardN on
> the target cluster show the same number of docs (N). I'm not as
> concerned if M != N at this point. Note I'm looking at the number of
> hits here, not say the document ordering.
>
> To do this you'll have to do the trick I mentioned where you query
> each replica separately.
>
> And are you absolutely sure that your different results are coming
> from the _same_ cluster? If you're comparing a query from the source
> cluster with a query from the target cluster, that's different than if
> the queries come from the same cluster.
>
> Best,
> Erick
>
> On Wed, Dec 14, 2016 at 2:48 PM, Webster Homer <webster.homer@sial.com>
> wrote:
> > Thanks for the quick feedback.
> >
> > We are not doing continuous indexing, we do a complete load once a week
> and
> > then have a daily partial load for any documents that have changed since
> > the load. These partial loads take only a few minutes every morning.
> >
> > The problem is we see this discrepancy long after the data load
> completes.
> >
> > We have a source collection that uses cdcr to replicate to the target. I
> > see the current=false setting in both the source and target collections.
> > Only the target collection is being heavily searched so that is where my
> > concern is. So what could cause this kind of issue?
> > Do we have a configuration problem?
> >
> > It doesn't happen all the time, so I don't currently have a reproducible
> > test case, yet.
> >
> > I will see about adding the timestamp, we have one, but it was created
> as a
> > string, and was generated by our ETL job
> >
> > On Wed, Dec 14, 2016 at 3:42 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> The commit points on different replicas will trip at different wall
> >> clock times so the leader and replica may return slightly different
> >> results depending on whether doc X was included in the commit on one
> >> replica but not on the second. After the _next_ commit interval (2
> >> seconds in your case), doc X will be committed on the second replica:
> >> that is it's not lost.
> >>
> >> Here's a couple of ways to verify:
> >>
> >> 1> turn off indexing and wait a few seconds. The replicas should have
> >> the exact same documents. "A few seconds" is your autocommit (soft in
> >> your case) interval + autowarm time. This last is unknown, but you can
> >> check your admin/plugins-stats search handler times, it's reported
> >> there. Now issue your queries. If the replicas don't report the same
> >> docs A Bad Thing that should be worrying. BTW, with a 2 second soft
> >> commit interval, which is really aggressive, you _better not_ have
> >> very large autowarm intervals!
> >>
> >> 2> Include a timestamp in your docs when they are indexed. There's an
> >> automatic way to do that BTW.... now do your queries and append an FQ
> >> clause like &fq=timestamp:[* TO some_point_in_the_past]. The replicas
> >> should have the same counts unless you are deleting documents. I
> >> mention deletes on the off chance that you're deleting documents that
> >> fall in the interval and then the same as above could theoretically
> >> occur. Updates should be fine.
> >>
> >> BTW, I've seen continuous monitoring of this done by automated
> >> scripts. The key is to get the shard URL and ping that with
> >> &distrib=false. It'll look something like
> >> http://host:port/solr/collection_shard1_replica1.... People usually
> >> just use *:* and compare numFound.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Wed, Dec 14, 2016 at 1:10 PM, Webster Homer <webster.homer@sial.com>
> >> wrote:
> >> > We are using Solr Cloud 6.2
> >> >
> >> > We have been noticing an issue where the index in a core shows as
> >> current =
> >> > false
> >> >
> >> > We have autocommit set for 15 seconds, and soft commit at 2 seconds
> >> >
> >> > This seems to cause two replicas to return different hits depending
> upon
> >> > which one is queried.
> >> >
> >> > What would lead to the indexes not being "current"? The documentation
> on
> >> > the meaning of current is vague.
> >> >
> >> > The collections in our cloud have two shards each with two replicas. I
> >> see
> >> > this with several of the collections.
> >> >
> >> > We don't know how they get like this but it's troubling
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.merckgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.merckgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

--001a1136a156e9af710543b6bca4--