lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jibo John <jiboj...@mac.com>
Subject Re: ThreadedIndexWriter vs. IndexWriter
Date Fri, 31 Jul 2009 22:31:52 GMT
Tried with a larger set of documents (2,000,000 ) this time.

ThreadedIndexWriter
-------------------------------
Size  - 1.4 G
optimized - yes (as suggested by Phil)
Number of documents - 1,999,924 (Not idea where the 76 documents  
vanished...)
Number of terms - 3,638,801


IndexWriter
---------------
Size - 1.8 G    (Noticed the size difference factor reduced to 23%)
optimized - yes
Number of documents  - 2,000,000  (All of them got in)
Number of terms - 10,624,806

I think it's getting complicated with more unanswered questions..

1. Why didn't those 76 docs get in while using ThreadedIndexWriter ?
2. Why would the number of terms triple for a difference of 76  
documents out of 2 million?
3. And, my original question..why there is still a huge variation in  
size difference b/n the two indexes


Thanks,
-Jibo


On Jul 31, 2009, at 1:44 PM, ohaya@cox.net wrote:

> Hi,
>
> Sorry to jump in, but I've been following this thread with  
> interest :)...
>
> Am I misunderstanding your original observation, that  
> ThreadedIndexWriter produced smaller index?  Did the  
> ThreadedIndexWriter also finish faster (I'm assuming that it should)?
>
> If the index is smaller, and everything else being good and equal,  
> doesn't that mean that using ThreadedIndexWriter is a good thing?
>
> Anyway, aside from checking that the # of documents were the same,  
> have you looked at the index using something like Luke?  Does the  
> contents of the index look the same in both cases, or were they  
> different?  If different, how so (e.g., missing terms, etc.)?
>
> Later,
> Jim
>
>
> On Fri, Jul 31, 2009 at 2:38 PM , Jibo John wrote:
>
>> Number of docs are the same in the index for both the cases  
>> (200,000).
>> I haven't altered the benchmark/ code, but, used a profiler to  
>> verify that  Benchmark main thread is closed only after all other   
>> threads are closed.
>>
>> Thanks,
>> -Jibo
>>
>>
>> On Jul 31, 2009, at 2:34 AM, Michael McCandless wrote:
>>
>>> Hmm... this doesn't sound right.
>>>
>>> That example (ThreadedIndexWriter) is meant to be a drop-in
>>> replacement, wherever you use an IndexWriter, that keeps an
>>> under-the-hood thread pool (using java.util.concurrent.*) to
>>> add/update documents with multiple threads.
>>>
>>> It should not result in a smaller index.
>>>
>>> Can you sanity check the index?  Eg is numDocs() the same for both?
>>> You definitely called close() on the writer, right?  That method  
>>> waits
>>> for all threads to finish their work before actually closing.
>>>
>>> Mike
>>>
>>> On Thu, Jul 30, 2009 at 8:01 PM, Jibo John<jibojohn@mac.com> wrote:
>>>> While trying out a few tuning options using contrib/benchmak as  
>>>> described in
>>>> LIA (2nd edition) book, I had an interesting observation.
>>>>
>>>> If I use a ThreadedIndexWriter (picked the example from lia2e,  
>>>> page 356)
>>>> instead of IndexWriter, the index size got reduced by 40%  
>>>> compared to using
>>>> IndexWriter.
>>>> Index related configuration were the same for both the tests in  
>>>> the alg
>>>> file.
>>>>
>>>> I am curious how come using a threaded index writer will have an  
>>>> impact on
>>>> the index size.
>>>>
>>>> Appreciate your input.
>>>>
>>>> Thanks,
>>>> -Jibo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message