lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elaine Cario <etca...@gmail.com>
Subject Re: solrcloud Auto-commit doesn't seem reliable
Date Wed, 21 Mar 2018 22:35:19 GMT
I'm just catching up on reading solr emails, so forgive me for being late
to this dance....

I've just gone through a project to enable CDCR on our Solr, and I also
experienced a small period of time where the commits on the source server
just seemed to stop.  This was during a period of intense experimentation
where I was mucking around with configurations, turning CDCR on/off, etc.
At some point the commits stopped occurring, and it drove me nuts for a
couple of days - tried everything - restarting Solr, reloading, turned
buffering on, turned buffering off, etc.  I finally threw up my hands and
rebooted the server out of desperation (it was a physical Linux box).
Commits worked fine after that.  I don't know what caused the commits to
stop, and why re-booting (and not just restarting Solr) caused them to work
fine.

Wondering if you ever found a solution to your situation?



On Fri, Feb 16, 2018 at 2:44 PM, Webster Homer <webster.homer@sial.com>
wrote:

> I meant to get back to this sooner.
>
> When I say I issued a commit I do issue it as collection/update?commit=true
>
> The soft commit interval is set to 3000, but I don't have a problem with
> soft commits ( I think). I was responding
>
> I am concerned that some hard commits don't seem to happen, but I think
> many commits do occur. I'd like suggestions on how to diagnose this, and
> perhaps an idea of where to look. Typically I believe that issues like this
> are from our configuration.
>
> Our indexing job is pretty simple, we send blocks of JSON to
> <collection>/update/json. We have either re-index the whole collection, or
> just apply updates. Typically we reindex the data once a week and delete
> any records that are older than the last full index. This does lead to a
> fair number of deleted records in the index especially if commits fail.
> Most of our collections are not large between 2 and 3 million records.
>
> The collections are hosted in google cloud
>
> On Mon, Feb 12, 2018 at 5:00 PM, Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > bq: But if 3 seconds is aggressive what would be a  good value for soft
> > commit?
> >
> > The usual answer is "as long as you can stand". All top-level caches are
> > invalidated, autowarming is done etc. on each soft commit. That can be a
> > lot of
> > work and if your users are comfortable with docs not showing up for,
> > say, 10 minutes
> > then use 10 minutes. As always "it depends" here, the point is not to
> > do unnecessary
> > work if possible.
> >
> > bq: If a commit doesn't happen how would there ever be an index merge
> > that would remove the deleted documents.
> >
> > Right, it wouldn't. It's a little more subtle than that though.
> > Segments on various
> > replicas will contain different docs, thus the term/doc statistics can be
> > a bit
> > different between multiple replicas. None of the stats will change
> > until the commit
> > though. You might try turning no distributed doc/term stats though.
> >
> > Your comments about PULL or TLOG replicas are well taken. However, even
> > those
> > won't be absolutely in sync since they'll replicate from the master at
> > slightly
> > different times and _could_ get slightly different segments _if_
> > there's indexing
> > going on. But let's say you stop indexing. After the next poll
> > interval all the replicas
> > will have identical characteristics and will score the docs the same.
> >
> > I don't have any signifiant wisdom to offer here, except this is really
> the
> > first time I've heard of this behavior. About all I can imagine is
> > that _somehow_
> > the soft commit interval is -1. When you say you "issue a commit" I'm
> > assuming
> > it's via ....collection/update?commit=true or some such which issues a
> > hard
> > commit with openSearcher=true. And it's on a _collection_ basis, right?
> >
> > Sorry I can't be more help
> > Erick
> >
> >
> >
> >
> > On Mon, Feb 12, 2018 at 10:44 AM, Webster Homer <webster.homer@sial.com>
> > wrote:
> > > Erick, I am aware of the CDCR buffering problem causing tlog retention,
> > we
> > > always turn buffering off in our cdcr configurations.
> > >
> > > My post was precipitated by seeing that we had uncommitted data in
> > > collections > 24 hours after it was loaded. The collections I was
> looking
> > > at are in our development environment, where we do not use CDCR.
> However
> > > I'm pretty sure that I've seen situations in production where commits
> > were
> > > also long overdue.
> > >
> > > the "autoSoftcommit" was a typo. The soft commit logic seems to be
> fine,
> > I
> > > don't see an issue with data visibility. But if 3 seconds is aggressive
> > > what would be a  good value for soft commit? We have a couple of
> > > collections that are updated every minute although most of them are
> > updated
> > > much less frequently.
> > >
> > > My reason for raising this commit issue is that we see problems with
> the
> > > relevancy of solrcloud searches, and the NRT replica type. Sometimes
> the
> > > results flip where the best hit varies by what replica serviced the
> > search.
> > > This is hard to explain to management. Doing an optimized does address
> > the
> > > problem for a while. I try to avoid optimizing for the reasons you and
> > Sean
> > > list. If a commit doesn't happen how would there ever be an index merge
> > > that would remove the deleted documents.
> > >
> > > The problem with deletes and relevancy don't seem to occur when we use
> > TLOG
> > > replicas, probably because they don't do their own indexing but get
> > copies
> > > from their leader. We are testing them now eventually we may abandon
> the
> > > use of NRT replicas for most of our collections.
> > >
> > > I am quite concerned about this commit issue. What kinds of things
> would
> > > influence whether a commit occurs? One commonality for our systems is
> > that
> > > they are hosted in a Google cloud. We have a number of collections that
> > > share configurations, but others that do not. I think commits do
> happen,
> > > but I don't trust that autoCommit is reliable. What can we do to make
> it
> > > reliable?
> > >
> > > Most of our collections are reindexed weekly with partial updates
> applied
> > > daily, that at least is what happens in production, our development
> > clouds
> > > are not as regular.
> > >
> > > Our solr startup script sets the following values:
> > > -Dsolr.autoCommit.maxDocs=35000
> > > -Dsolr.autoCommit.maxTime=60000
> > > -Dsolr.autoSoftCommit.maxTime=3000
> > >
> > > I don't think we reference  solr.autoCommit.maxDocs in our
> solrconfig.xml
> > > files.
> > >
> > > here are our settings for autoCommit and autoSoftCommit
> > >
> > > We had a lot of issues with missing commits when we didn't set
> > > solr.autoCommit.maxTime
> > >      <autoCommit>
> > >        <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
> > >        <openSearcher>false</openSearcher>
> > >     </autoCommit>
> > >
> > >      <autoSoftCommit>
> > >        <maxTime>${solr.autoSoftCommit.maxTime:5000}</maxTime>
> > >      </autoSoftCommit>
> > >
> > >
> > >
> > > On Fri, Feb 9, 2018 at 3:49 PM, Shawn Heisey <apache@elyograg.org>
> > wrote:
> > >
> > >> On 2/9/2018 9:29 AM, Webster Homer wrote:
> > >>
> > >>> A little more background. Our production Solrclouds are populated via
> > >>> CDCR,
> > >>> CDCR does not replicate commits, Commits to the target clouds happen
> > via
> > >>> autoCommit settings
> > >>>
> > >>> We see relvancy scores get inconsistent when there are too many
> deletes
> > >>> which seems to happen when hard commits don't happen.
> > >>>
> > >>> On Fri, Feb 9, 2018 at 10:25 AM, Webster Homer <
> webster.homer@sial.com
> > >
> > >>> wrote:
> > >>>
> > >>> I we do have autoSoftcommit set to 3 seconds. It is NOT the
> visibility
> > of
> > >>>> the records that is my primary concern. I am concerned about is
the
> > >>>> accumulation of uncommitted tlog files and the larger number of
> > deleted
> > >>>> documents.
> > >>>>
> > >>>
> > >> For the deleted documents:  Have you ever done an optimize on the
> > >> collection?  If so, you're going to need to re-do the optimize
> > regularly to
> > >> keep deleted documents from growing out of control.  See this issue
> for
> > a
> > >> very technical discussion about it:
> > >>
> > >> https://issues.apache.org/jira/browse/LUCENE-7976
> > >>
> > >> Deleted documents probably aren't really related to what we've been
> > >> discussing.  That shouldn't really be strongly affected by commit
> > settings.
> > >>
> > >> -----
> > >>
> > >> A 3 second autoSoftCommit is VERY aggressive.   If your soft commits
> are
> > >> taking longer than 3 seconds to complete, which is often what happens,
> > then
> > >> that will lead to problems.  I wouldn't expect it to cause the kinds
> of
> > >> problems you describe, though.  It would manifest as Solr working too
> > hard,
> > >> logging warnings or errors, and changes taking too long to show up.
> > >>
> > >> Assuming that the config for autoSoftCommit doesn't have the typo that
> > >> Erick mentioned.
> > >>
> > >> ----
> > >>
> > >> I have never used CDCR, so I know very little about it.  But I have
> seen
> > >> reports on this mailing list saying that transaction logs never get
> > deleted
> > >> when CDCR is configured.
> > >>
> > >> Below is a link to a mailing list discussion related to CDCR not
> > deleting
> > >> transaction logs.  Looks like for it to work right a buffer needs to
> be
> > >> disabled, and there may also be problems caused by not having a
> complete
> > >> zkHost string in the CDCR config:
> > >>
> > >> http://lucene.472066.n3.nabble.com/CDCR-how-to-deal-with-
> > >> the-transaction-log-files-td4345062.html
> > >>
> > >> Erick also mentioned this.
> > >>
> > >> Thanks,
> > >> Shawn
> > >>
> > >
> > > --
> > >
> > >
> > > This message and any attachment are confidential and may be privileged
> or
> > > otherwise protected from disclosure. If you are not the intended
> > recipient,
> > > you must not copy this message or attachment or disclose the contents
> to
> > > any other person. If you have received this transmission in error,
> please
> > > notify the sender immediately and delete the message and any attachment
> > > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not accept liability for any omissions or errors in
> this
> > > message which may arise as a result of E-Mail-transmission or for
> damages
> > > resulting from any unauthorized changes of the content of this message
> > and
> > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > > subsidiaries do not guarantee that this message is free of viruses and
> > does
> > > not accept liability for any damages caused by any virus transmitted
> > > therewith.
> > >
> > > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > > Spanish and Portuguese versions of this disclaimer.
> >
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message