lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Umashanker, Srividhya" <srividhya.umashan...@hp.com>
Subject Re: Concurrent Indexing
Date Sat, 21 Jun 2014 05:20:44 GMT
Let me try with the NRT and periodic commit  say every 5 mins in a committer thread on need
basis.

Is there a threshold limit on how long we can go without committing ? I think the buffers
get flushed to disk but not to crash proof on disk. So we should be good on memory.

I should also verify if the time taken for commit() is longer when more data piled up to commit.
 But definitely should be better than  committing for every thread..

Will post back after tests.

- Vidhya


> On 21-Jun-2014, at 10:28 am, "Vitaly Funstein" <vfunstein@gmail.com> wrote:
> 
> Hmm, I might have actually given you a slightly incorrect explanation wrt
> what happens when internal buffers fill up. There will definitely be a
> flush of the buffer, and segment files will be written to, but it's not
> actually considered a full commit, i.e. an external reader will not see
> these changes (yet). The exact details elude me but there are quite a few
> threads here on what happens during a commit (vs a flush). However, when
> you call IndexWriter.close() a commit will definitely happen.
> 
> But in any event, if you use an NRT reader to search, then it shouldn't
> matter to you when the commit actually takes place. Such readers also
> search uncommitted changes as well as those already on disk. If data
> durability is not a requirement for you, if i.e. you can (and probably do)
> reindex your data from SOR on startup, then not doing commits yourself may
> be the way to go. Or perhaps you could reduce the amount of data you need
> to reindex and still call commit() yourself periodically though not for
> every write transaction, but maybe introduce some watermarking logic
> whereby you detect the highest watermark committed to Lucene. Then reindex
> only the data from the DB from that point onward (meaning only uncommitted
> data is lost and needs to be recovered, but you can figure out exactly
> where that point is).
> 
> 
> 
> On Fri, Jun 20, 2014 at 8:02 PM, Umashanker, Srividhya <
> srividhya.umashanker@hp.com> wrote:
> 
>> It is non transactional. We first write the same data to database in a
>> transaction and then call writer addDocument.  If lucene fails we still
>> hold the data to recover.
>> 
>> I can avoid the commit if we use NRT reader. We do need this to be
>> searchable immediately.
>> 
>> Another question. I did try removing commit() in each thread and wait for
>> lucene to auto commit with maxBufferedDocs set to 100 and ramBufferedSize
>> set to high values, so docs triggers first. But did not see the 1st 100
>> docs data in lucene even after 500 docs.
>> 
>> Is there a way for me to see when lucene auto commits?
>> 
>> If we tune the auto commit parameters appropriately, do i still need the
>> committer thread ? Because it's job is to call commit. Anyway
>> add/updateDocument is already done in my writer threads.
>> 
>> Thanks for your time and your suggestions!
>> 
>> - Vidhya
>> 
>> 
>>> On 21-Jun-2014, at 12:09 am, "Vitaly Funstein" <vfunstein@gmail.com>
>> wrote:
>>> 
>>> You could just avoid calling commit() altogether if your application's
>>> semantics allow this (i.e. it's non-transactional in nature). This way,
>>> Lucene will do commits when appropriate, based on the buffering settings
>>> you chose. It's generally unnecessary and undesirable to call commit at
>> the
>>> end of each write, unless you see to provide strict durability guarantees
>>> in your system.
>>> 
>>> If you must acknowledge every write after it's been committed, set up a
>>> single committer thread that does this when there are any work tasks in
>> the
>>> queue. Then add to that queue from your writer threads...
>>> 
>>> 
>>> On Fri, Jun 20, 2014 at 8:47 AM, Umashanker, Srividhya <
>>> srividhya.umashanker@hp.com> wrote:
>>> 
>>>> Lucene Experts -
>>>> 
>>>> Recently we upgraded to Lucene 4. We want to make use of concurrent
>>>> flushing feature Of Lucene.
>>>> 
>>>> Indexing for us includes certain db operations and writing to lucene
>> ended
>>>> by commit.  There may be multiple concurrent calls to Indexer to publish
>>>> single/multiple records.
>>>> 
>>>> So far, with older version of lucene, we had our indexing synchronized
>> (1
>>>> thread indexing).
>>>> Which means waiting time is more, based on concurrency and execution
>> time.
>>>> 
>>>> We are moving away from the Synchronized indexing. Which is actually to
>>>> cut down the waiting period.  Trying to find out if we have to limit the
>>>> number of threads that adds document and commits.
>>>> 
>>>> Below are the tests - to publish just 1000 records with 3 text fields.
>>>> 
>>>> Java 7 , JVM config :  -XX:MaxPermSize=384M
>>>> -XX:+HeapDumpOnOutOfMemoryError  -Xmx400m -Xms50m -XX:MaxNewSize=100m
>>>> -Xss256k -XX:-UseParallelOldGC -XX:-UseSplitVerifier
>>>> -Djsse.enableSNIExtension=false
>>>> 
>>>> IndexConfiguration being default : We also tried with changes in
>>>> maxThreadStates,maxBufferedDocs,ramBufferSizeMB - no impact.
>>>> 
>>>> 
>>>> 
>>>> Min time  in ms
>>>> 
>>>> Max time ms
>>>> 
>>>> Avg time ms
>>>> 
>>>> 1 thread -commit
>>>> 
>>>> 65
>>>> 
>>>> 267
>>>> 
>>>> 85
>>>> 
>>>> 1 thread -updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 40
>>>> 
>>>> 1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 6 thread-commit
>>>> 
>>>> 83
>>>> 
>>>> 1449
>>>> 
>>>> 552.42
>>>> 
>>>> 6 thread- updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 175
>>>> 
>>>> 1.5
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 10 thread -Commit
>>>> 
>>>> 154
>>>> 
>>>> 2429
>>>> 
>>>> 874
>>>> 
>>>> 10 thread- updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 243
>>>> 
>>>> 1.9
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 20 thread -commit
>>>> 
>>>> 76
>>>> 
>>>> 4351
>>>> 
>>>> 1622
>>>> 
>>>> 20 thread - updateDocument
>>>> 
>>>> 0
>>>> 
>>>> 326
>>>> 
>>>> 2.1
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> More the threads trying to write to lucene, the updateDocument and
>>>> commit() are becoming bottlenecks.  In the above table, 10 and 20
>> threads
>>>> have an average of 1.5 sec for 1000 commits.
>>>> 
>>>> Is there some configuration of suggestions to tune the performance of
>> the
>>>> 2 methods, so that our service performs better, with more concurrency?
>>>> 
>>>> -vidhya
>>>> 
>>>> 
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message