lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: BufferedUpdateStreams breaks high performance indexing
Date Thu, 28 Jul 2016 15:34:31 GMT
Hmm, your merge policy changes are dangerous: that will cause too many
segments in the index, which makes it longer to apply deletes.

Can you revert that and re-test?

I'm not sure why DIH is using updateDocument instead of addDocument ...
maybe ask on the solr-user list?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 28, 2016 at 10:07 AM, Bernd Fehling <
bernd.fehling@uni-bielefeld.de> wrote:

> Currently I use concurrent DIH but will write some SolrJ for testing
> or even as replacement for DIH.
> Don't know whats behind DIH if only documents are added.
>
> Not tried any newer release yet, but after reading LUCENE-6161 I really
> should.
> At least a version > 5.1
> May be before writing some SolrJ.
>
>
> Yes IndexWriterConfig is changed from default:
> <indexConfig>
>     <maxIndexingThreads>8</maxIndexingThreads>
>     <ramBufferSizeMB>1024</ramBufferSizeMB>
>     <maxBufferedDocs>-1</maxBufferedDocs>
>     <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
>       <int name="maxMergeAtOnce">8</int>
>       <int name="segmentsPerTier">100</int>
>       <int name="maxMergedSegmentMB">512</int>
>     </mergePolicy>
>     <mergeFactor>8</mergeFactor>
>     <mergeScheduler
> class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
>     <lockType>${solr.lock.type:native}</lockType>
>     ...
> </indexConfig>
>
> A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
> Somewhere between 20 and 50 characters in length.
>
> Thanks for your help,
> Bernd
>
>
> Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> > Hmm not good.
> >
> > If you are really only adding documents, you should be using
> > IndexWriter.addDocument, which won't buffer any deleted terms and that
> > method call should be a no-op.  It also makes flushes more efficient
> since
> > all of your indexing buffer goes to the added documents, not buffered
> > delete terms.  Are you using updateDocument?
> >
> > Can you reproduce this slowness on a newer release?  There have been
> > performance issues fixed in newer releases in this method, e.g
> > https://issues.apache.org/jira/browse/LUCENE-6161
> >
> > Have you changed any IndexWriterConfig settings from defaults?
> >
> > What are your unique id fields like?  How many bytes in length?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> > bernd.fehling@uni-bielefeld.de> wrote:
> >
> >> While trying to get higher performance for indexing it turned out that
> >> BufferedUpdateStreams is breaking indexing performance.
> >> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
> >>
> >> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene
> 4.10.4
> >> API states:
> >> "Determines the amount of RAM that may be used for buffering added
> >> documents and deletions before they are flushed to the Directory.
> >> Generally for faster indexing performance it's best to flush by RAM
> >> usage instead of document count and use as large a RAM buffer as you
> can."
> >>
> >> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
> >>
> >> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
> >> infos=...
> >> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes
> took
> >> 3411845 msec
> >>
> >> About 56 minutes no indexing and only applying deletes.
> >> What is it deleting?
> >>
> >> If the index gets bigger the time gets longer, currently 2.5 hours of
> >> waiting.
> >> I'm adding 96 million docs with uniq id, no duplicates, only add, no
> >> deletes.
> >>
> >> Any suggestions which config is _really_ going for high performance
> >> indexing?
> >>
> >> Best regards,
> >> Bernd
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message