lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <nnagaraja...@transaxtions.com>
Subject RE: "batch-update"-pattern, NoMergeScheduler?
Date Tue, 23 Dec 2014 16:50:22 GMT
You can try out the TimedSerialMergeScheduler. It allows you to set a merge schedule to a time
in the evening or after n number of merge requests  ...

http://rankingalgorithm.1050964.n5.nabble.com/TimedSerialMergerScheduler-java-allows-merges-to-be-deferred-to-a-known-time-like-11pm-or-1am-td5706350.html

It is based on the SerialMergeScheduler so will block until the merges complete. Like Ian
said, parallel merge may do the trick for you especially if you can build your index on a
very fast io like ssd, etc.

Warm Regards

-Nagendra Nagarajayya
http://solr-ra.tgels.org
http://elasticsearch-ra.tgels.org
http://rankingalgorithm.tgels.org

-----Original Message-----
From: Ian Lea [mailto:ian.lea@gmail.com] 
Sent: Tuesday, December 23, 2014 2:54 AM
To: java-user@lucene.apache.org
Subject: Re: "batch-update"-pattern, NoMergeScheduler?

Hi


I can't give an exact answer to your question but my experience has been that it's best to
leave all the merge/buffer/etc settings alone.
If you are doing a bulk update of a large number of docs then it's no surprise that you are
seeing a heavy IO load.  If you can, it's likely to be worth giving lucene a dedicated disk
or at least make sure there's as little contention as possible - that's just general advice
for any workload.  There is always going to a limiting factor somewhere.

You could also experiment with multiple threads, or multiple jobs writing to separate indexes
with a standalone merge at the end.  In my experience these have generally been more trouble
than they're worth, but the occasions when I do bulk loads of large number of docs are sufficiently
rare that I'm not too bothered how long it takes.


--
Ian.



--
Ian.


On Mon, Dec 22, 2014 at 9:45 AM, Clemens Wyss DEV <clemensdev@mysign.ch> wrote:
> One of our indexes is updated completely quite frequently -> "batch update" or "re-index".
> If so more than 2million documents are added/updated to/in the very index. This creates
an immense IO load on our system. Does it make sense to set merge scheduler to NoMergeScheduler
(and/or MergePolicy to NoMergePolicy). Or is merging "not relevant" as the commit is done
at the very end only?
>
> Context information:
> At the moment the writer's config consists only of setRAMBufferSizeMB:
> IndexWriterConfig config = new IndexWriterConfig( 
> IndexManager.CURRENT_LUCENE_VERSION, analyzer ); 
> config.setMergePolicy( NoMergePolicy.NO_COMPOUND_FILES ); 
> //config.setMergeScheduler( NoMergeScheduler.INSTANCE ); 
> config.setRAMBufferSizeMB( 20 );
>
> The update logic is as follows:
> indexWriter.deleteAll()
> ...
> for all elements do {
> ...
> indexWriter.updateDocument( term, doc ); // in order to omit "duplicate entries"
> ...
> }
> indexWriter.commit
>
> What is the proposed way to perform such a batch update?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message