lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: BufferedUpdateStreams breaks high performance indexing
Date Fri, 29 Jul 2016 13:04:58 GMT
The deleted terms accumulate whenever you use updateDocument(Term, Doc), or
when you do deleteDocuments(Term).

Deleted queries are when you delete by query, but I don't think DIH would
be doing that unless you asked it to ... maybe a Solr user/dev knows better?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 29, 2016 at 3:21 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Yes, with default of 10 it performs very much better.
> I didn't take into count that DIH uses updateDocument for adding new
> documents but after thinking about the "why" I assume that
> this might be because you don't know if a document already exists in the
> index.
> Conclusion, using DIH and setting segmentsPerTier to a high value is a
> killer.
>
> One question still remains about messages in INFOSTREAM, I have lines
> saying
> BD 0 [...] push deletes 24345 deleted terms (unique count=24345) 24345
> deleted queries
>            bytesUsed=2313024 delGen=2265 packetCount=69
> totBytesUsed=262526720
> ...
> BD 0 [...] seg=_xt(4.10.4):C50486 segGen=2370 segDeletes=[ 97145 deleted
> terms (unique count=0)
>            97142 deleted queries bytesUsed=3108576]; coalesced deletes=
>
>  [CoalescedUpdates(termSets1,queries=75721,numericDVUpdates=0,binaryDVUpdates=0)]
>             newDelCount=0
>
> Do you know what these deleted terms and deleted queries are?
>
> Best regards,
> Bernd
>
>
> Am 28.07.2016 um 17:34 schrieb Michael McCandless:
> > Hmm, your merge policy changes are dangerous: that will cause too many
> > segments in the index, which makes it longer to apply deletes.
> >
> > Can you revert that and re-test?
> >
> > I'm not sure why DIH is using updateDocument instead of addDocument ...
> > maybe ask on the solr-user list?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> Currently I use concurrent DIH but will write some SolrJ for testing
> >> or even as replacement for DIH.
> >> Don't know whats behind DIH if only documents are added.
> >>
> >> Not tried any newer release yet, but after reading LUCENE-6161 I really
> >> should.
> >> At least a version > 5.1
> >> May be before writing some SolrJ.
> >>
> >>
> >> Yes IndexWriterConfig is changed from default:
> >> <indexConfig>
> >>     <maxIndexingThreads>8</maxIndexingThreads>
> >>     <ramBufferSizeMB>1024</ramBufferSizeMB>
> >>     <maxBufferedDocs>-1</maxBufferedDocs>
> >>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
> >>       <int name="maxMergeAtOnce">8</int>
> >>       <int name="segmentsPerTier">100</int>
> >>       <int name="maxMergedSegmentMB">512</int>
> >>     </mergePolicy>
> >>     <mergeFactor>8</mergeFactor>
> >>     <mergeScheduler
> >> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
> >>     <lockType>${solr.lock.type:native}</lockType>
> >>     ...
> >> </indexConfig>
> >>
> >> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> >> Somewhere between 20 and 50 characters in length.
> >>
> >> Thanks for your help,
> >> Bernd
> >>
> >>
> >> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> >>> Hmm not good.
> >>>
> >>> If you are really only adding documents, you should be using
> >>> IndexWriter.addDocument, which won't buffer any deleted terms and that
> >>> method call should be a no-op.  It also makes flushes more efficient
> >> since
> >>> all of your indexing buffer goes to the added documents, not buffered
> >>> delete terms.  Are you using updateDocument?
> >>>
> >>> Can you reproduce this slowness on a newer release?  There have been
> >>> performance issues fixed in newer releases in this method, e.g
> >>> https://issues.apache.org/jira/browse/LUCENE-6161
> >>>
> >>> Have you changed any IndexWriterConfig settings from defaults?
> >>>
> >>> What are your unique id fields like?  How many bytes in length?
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> >>> bernd.fehling@uni-bielefeld.de> wrote:
> >>>
> >>>> While trying to get higher performance for indexing it turned out that
> >>>> BufferedUpdateStreams is breaking indexing performance.
> >>>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>>>
> >>>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> >> 4.10.4
> >>>> API states:
> >>>> "Determines the amount of RAM that may be used for buffering added
> >>>> documents and deletions before they are flushed to the Directory.
> >>>> Generally for faster indexing performance it's best to flush by RAM
> >>>> usage instead of document count and use as large a RAM buffer as you
> >> can."
> >>>>
> >>>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>>>
> >>>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> >>>> infos=...
> >>>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> >> took
> >>>> 3411845 msec
> >>>>
> >>>> About 56 minutes no indexing and only applying deletes.
> >>>> What is it deleting?
> >>>>
> >>>> If the index gets bigger the time gets longer, currently 2.5 hours of
> >>>> waiting.
> >>>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >>>> deletes.
> >>>>
> >>>> Any suggestions which config is _really_ going for high performance
> >>>> indexing?
> >>>>
> >>>> Best regards,
> >>>> Bernd
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                    Bielefeld University Library
> Dipl.-Inform. (FH)                LibTec - Library Technology
> Universitätsstr. 25                  and Knowledge Management
> 33615 Bielefeld
> Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message