lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jibo John <jiboj...@mac.com>
Subject Re: ThreadedIndexWriter vs. IndexWriter
Date Tue, 11 Aug 2009 20:13:41 GMT
Mike,

Yes, it works perfect !

I did observe a dip in the indexing throughput (1855 recs/sec vs. 2200  
recs/sec previously), but, more importantly, no data is lost this time.

Thanks for helping me nail this down.

-Jibo



On Aug 11, 2009, at 11:12 AM, Michael McCandless wrote:

> OK I found the problem!
>
> It was losing docs from the queue, when shutting down the thread pool,
> because we were calling super's addDocument(doc) not addDocument(doc,
> analyzer).  IndexWriter was simply forwarding that call to
> ThreadedIndexWriter's addDocument(doc, analyzer) which in turn would
> do nothing because the thread pool was already told to shut down.
> Larger queues made it much more likely to happen.
>
> Can you try the new version (attached)?
>
> Also, make sure you add 'doc.reuse.fields=false' to your alg (on
> trunk).
>
> Mike
>
> On Tue, Aug 11, 2009 at 12:39 PM, Jibo John<jibojohn@mac.com> wrote:
>> Mike,
>>
>> I wasn't exactly using the lucene core jar from MEAP.
>>
>> I have been building lucene from the source, and running the tests  
>> under
>> lucene/java/trunk/contrib/benchmark/ (checked out 2 weeks ago, I  
>> guess)
>>  and, also under  lucene/java/tags/lucene_2_4_1/contrib/benchmark/.
>> In both cases, copied CreateThreadedIndexTask to
>> org.apache.lucene.benchmark.byTask.tasks and ThreadedIndexWriter to
>> org.apache.lucene.index.
>>
>> I have observed the issue in both the versions of lucene.
>>
>> Indexes were optimized separately using Lucli.
>>
>>
>> PFA the classes and the alg.
>>
>>
>>
>>
>>
>>
>> Thank you for your help with this one.
>>
>> -Jibo
>>
>>
>>
>>
>> On Aug 11, 2009, at 3:13 AM, Michael McCandless wrote:
>>
>>> I'm baffled why you're losing docs w/ ThreadedIndexWriter.
>>>
>>> One question: your Lucene core JAR seems to be newer than the last
>>> MEAP update.  Did you update it manually?
>>>
>>> Also, your indexes were optimized, but your algs don't have an
>>> optimize step -- did you separately run an optimize?
>>>
>>> Could you zip up the whole shebang (ThreadedIndexWriter.java,
>>> CreateThreadedIndexTask.java, the algs) & post?  Please CC me  
>>> directly
>>> so I can grab the zip file... thanks.
>>>
>>> Mike
>>>
>>> On Mon, Aug 3, 2009 at 12:37 PM, Jibo John<jibojohn@mac.com> wrote:
>>>>
>>>> Mike,
>>>>
>>>> Verified that I have the latest source code.
>>>> Here are the alg files and the checkindexer output.
>>>>
>>>>
>>>> ----------------------------------------- indexwriter
>>>> alg----------------------------------------------------------------
>>>>
>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>> directory=FSDirectory
>>>>
>>>> doc.stored =  
>>>> true                                                    #A
>>>> docs.file=wikipedia.lines.txt
>>>> ram.flush.mb=50
>>>> compound=false
>>>> merge.factor=5
>>>> doc.add.log.step=1000
>>>> doc.term.vector=false
>>>> doc.term.vector.positions=false
>>>> doc.term.vector.offsets=false
>>>>
>>>> { "Rounds 
>>>> "                                                           #B
>>>>  ResetSystemErase
>>>>  { "BuildIndex"
>>>>  -CreateIndex()
>>>>  [ { "AddDocs" AddDoc > : 40000 ] : 5
>>>>  #C
>>>>  -CloseIndex()
>>>>  }
>>>>  NewRound
>>>> } : 1
>>>>
>>>> RepSumByPrefRound  
>>>> BuildIndex                                         #D
>>>>
>>>> -----------------------------------------threadedindexwriter alg
>>>> ----------------------------------------------------------------
>>>>
>>>> analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
>>>> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
>>>> directory=FSDirectory
>>>>
>>>> doc.stored =  
>>>> true                                                    #A
>>>> docs.file=wikipedia.lines.txt
>>>> ram.flush.mb=50
>>>> compound=false
>>>> merge.factor=5
>>>> doc.add.log.step=1000
>>>> doc.term.vector=false
>>>> doc.term.vector.positions=false
>>>> doc.term.vector.offsets=false
>>>> writer.num.threads=15
>>>> writer.max.thread.queue.size=75
>>>> work.dir=work_t
>>>>
>>>>
>>>> { "Rounds 
>>>> "                                                           #B
>>>>  ResetSystemErase
>>>>  { "BuildIndex"
>>>>  -CreateThreadedIndex()
>>>>  { "AddDocs" AddDoc > : 200000
>>>>  -CloseIndex()
>>>>  }
>>>>  NewRound
>>>> } : 1
>>>>
>>>> RepSumByPrefRound  
>>>> BuildIndex                                         #D
>>>>
>>>>
>>>> -----------------------------------------------threadedindexwriter
>>>> checkindex  
>>>> ----------------------------------------------------------
>>>>
>>>>
>>>> $ java -classpath
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9- 
>>>> dev.jar
>>>> org.apache.lucene.index.CheckIndex
>>>>
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work_t/index
>>>>
>>>> NOTE: testing will be more thorough if you run java with
>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>
>>>> Opening index @
>>>>
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work_t/index
>>>>
>>>> Segments file=segments_3 numSegments=1 version=FORMAT_DIAGNOSTICS  
>>>> [Lucene
>>>> 2.9]
>>>>  1 of 1: name=_p docCount=199941
>>>>  compound=true
>>>>  hasProx=true
>>>>  numFiles=3
>>>>  size (MB)=317.1
>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
>>>> 779767M -
>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>> mergeDocStores=false, java.vendor=Apple Inc., os.version=10.5.7,
>>>> source=merge, mergeFactor=5}
>>>>  docStoreOffset=0
>>>>  docStoreSegment=_0
>>>>  docStoreIsCompoundFile=false
>>>>  no deletions
>>>>  test: open reader.........OK
>>>>  test: fields, norms.......OK [4 fields]
>>>>  test: terms, freq, prox...OK [1269552 terms; 67887116 terms/docs  
>>>> pairs;
>>>> 133241176 tokens]
>>>>  test: stored fields.......OK [199941 total field count; avg 1  
>>>> fields per
>>>> doc]
>>>>  test: term vectors........OK [0 total vector count; avg 0 term/ 
>>>> freq
>>>> vector
>>>> fields per doc]
>>>>
>>>> No problems were detected with this index.
>>>>
>>>> ------------------------------------------indexwriter checkindex
>>>> ---------------------------------------------------------------
>>>>
>>>> $ java -classpath
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/build/lucene-core-2.9- 
>>>> dev.jar
>>>> org.apache.lucene.index.CheckIndex
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work/index
>>>>
>>>> NOTE: testing will be more thorough if you run java with
>>>> '-ea:org.apache.lucene...', so assertions are enabled
>>>>
>>>> Opening index @
>>>> /Users/jibo/Desktop/iwork/lucene/java/trunk/contrib/benchmark/ 
>>>> work/index
>>>>
>>>> Segments file=segments_a numSegments=1 version=FORMAT_DIAGNOSTICS  
>>>> [Lucene
>>>> 2.9]
>>>>  1 of 1: name=_18 docCount=200000
>>>>  compound=true
>>>>  hasProx=true
>>>>  numFiles=1
>>>>  size (MB)=427.445
>>>>  diagnostics = {java.version=1.5.0_19, lucene.version=2.9-dev  
>>>> 779767M -
>>>> 2009-05-28 17:02:17, os=Mac OS X, os.arch=i386, optimize=true,
>>>> mergeDocStores=true, java.vendor=Apple Inc., os.version=10.5.7,
>>>> source=merge, mergeFactor=4}
>>>>  no deletions
>>>>  test: open reader.........OK
>>>>  test: fields, norms.......OK [4 fields]
>>>>  test: terms, freq, prox...OK [3512343 terms; 80020204 terms/docs  
>>>> pairs;
>>>> 163219760 tokens]
>>>>  test: stored fields.......OK [200000 total field count; avg 1  
>>>> fields per
>>>> doc]
>>>>  test: term vectors........OK [0 total vector count; avg 0 term/ 
>>>> freq
>>>> vector
>>>> fields per doc]
>>>>
>>>> No problems were detected with this index.
>>>>
>>>>
>>>> ---------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Aug 1, 2009, at 2:08 AM, Michael McCandless wrote:
>>>>
>>>>> (Please note that ThreadedIndexWriter is source code available  
>>>>> with
>>>>> the upcoming revision to Lucene in Action.)
>>>>>
>>>>> Phil, is it possible you are using an older version of the book's
>>>>> source code?  In particular, can you check whether your version of
>>>>> ThreadedIndexWriter.java has this:
>>>>>
>>>>>  public void close(boolean doWait) throws CorruptIndexException,
>>>>> IOException {
>>>>>  finish();
>>>>>  super.close(doWait);
>>>>>  }
>>>>>
>>>>> (I vaguely remember that being missing from earlier releases,  
>>>>> which
>>>>> could explain what you're seeing).  If you are missing that, can  
>>>>> you
>>>>> download the current code from http://www.manning.com/hatcher3  
>>>>> and try
>>>>> again?
>>>>>
>>>>> If that's not the problem... can you post the benchmark alg you  
>>>>> are
>>>>> using in each case?
>>>>>
>>>>> Mike
>>>>>
>>>>> On Fri, Jul 31, 2009 at 8:26 PM, Jibo John<jibojohn@mac.com>  
>>>>> wrote:
>>>>>>
>>>>>> Hi Phil,
>>>>>>
>>>>>> It's 5 threads for IndexWriter.
>>>>>>
>>>>>> For ThreadedIndexWriter, I used:
>>>>>>
>>>>>> writer.num.threads=16
>>>>>> writer.max.thread.queue.size=80
>>>>>>
>>>>>> Thanks,
>>>>>> -Jibo
>>>>>>
>>>>>> On Jul 31, 2009, at 5:01 PM, Phil Whelan wrote:
>>>>>>
>>>>>>> Hi Jibo,
>>>>>>>
>>>>>>> Your mergeFactor is different, and the resulting numFiles  
>>>>>>> (segment
>>>>>>> files) is different. Maybe each thread is responsible for a 

>>>>>>> segment
>>>>>>> file. Just curious - do you have 3 threads?
>>>>>>>
>>>>>>> Phil
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user- 
>>>>>>> help@lucene.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message