lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
Subject Re: BufferedUpdateStreams breaks high performance indexing
Date Thu, 28 Jul 2016 14:07:09 GMT
Currently I use concurrent DIH but will write some SolrJ for testing
or even as replacement for DIH.
Don't know whats behind DIH if only documents are added.

Not tried any newer release yet, but after reading LUCENE-6161 I really should.
At least a version > 5.1
May be before writing some SolrJ.


Yes IndexWriterConfig is changed from default:
<indexConfig>
    <maxIndexingThreads>8</maxIndexingThreads>
    <ramBufferSizeMB>1024</ramBufferSizeMB>
    <maxBufferedDocs>-1</maxBufferedDocs>
    <mergePolicy class="org.apache.lucene.index.TieredMergePolicy">
      <int name="maxMergeAtOnce">8</int>
      <int name="segmentsPerTier">100</int>
      <int name="maxMergedSegmentMB">512</int>
    </mergePolicy>
    <mergeFactor>8</mergeFactor>
    <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"/>
    <lockType>${solr.lock.type:native}</lockType>
    ...
</indexConfig>

A unique id as example: "ftoxfordilej:ar.1770.x.x.13.x.x.u1"
Somewhere between 20 and 50 characters in length.

Thanks for your help,
Bernd


Am 28.07.2016 um 15:35 schrieb Michael McCandless:
> Hmm not good.
> 
> If you are really only adding documents, you should be using
> IndexWriter.addDocument, which won't buffer any deleted terms and that
> method call should be a no-op.  It also makes flushes more efficient since
> all of your indexing buffer goes to the added documents, not buffered
> delete terms.  Are you using updateDocument?
> 
> Can you reproduce this slowness on a newer release?  There have been
> performance issues fixed in newer releases in this method, e.g
> https://issues.apache.org/jira/browse/LUCENE-6161
> 
> Have you changed any IndexWriterConfig settings from defaults?
> 
> What are your unique id fields like?  How many bytes in length?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Thu, Jul 28, 2016 at 5:01 AM, Bernd Fehling <
> bernd.fehling@uni-bielefeld.de> wrote:
> 
>> While trying to get higher performance for indexing it turned out that
>> BufferedUpdateStreams is breaking indexing performance.
>> public synchronized ApplyDeletesResult applyDeletesAndUpdates(...)
>>
>> At IndexWriterConfig I have setRAMBufferSizeMB=1024 and the Lucene 4.10.4
>> API states:
>> "Determines the amount of RAM that may be used for buffering added
>> documents and deletions before they are flushed to the Directory.
>> Generally for faster indexing performance it's best to flush by RAM
>> usage instead of document count and use as large a RAM buffer as you can."
>>
>> Also setMaxBufferedDocs=-1 and setMaxBufferedDeleteTerms=-1.
>>
>> BD 0 [Wed Jul 27 13:42:03 GMT+01:00 2016; Thread-27890]: applyDeletes:
>> infos=...
>> BD 0 [Wed Jul 27 14:38:55 GMT+01:00 2016; Thread-27890]: applyDeletes took
>> 3411845 msec
>>
>> About 56 minutes no indexing and only applying deletes.
>> What is it deleting?
>>
>> If the index gets bigger the time gets longer, currently 2.5 hours of
>> waiting.
>> I'm adding 96 million docs with uniq id, no duplicates, only add, no
>> deletes.
>>
>> Any suggestions which config is _really_ going for high performance
>> indexing?
>>
>> Best regards,
>> Bernd
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message