lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andi Vajda <>
Subject Re: possible bug with indexing with term vectors
Date Sat, 29 Sep 2007 17:04:27 GMT

On Sat, 29 Sep 2007, Michael McCandless wrote:

>> The new PyLucene is built with a code generator and all public APIs and
>> classes are made available to Python. SerialMergeScheduler is available.
> Wild!  Does this mean PyLucene will track tightly to Lucene releases
> going forward?

Yes, even more tightly than before since I don't have to patch the Lucene 
sources anymore.

> What happened prior to this first optimize call?  Did you just create
> the writer, switch to SerialMergeScheduler, add N docs, then call
> setInfoStream(...) and writer.optimize()?

Yes, that's almost exactly it. I create the writer new (with create=true) then 
close it and its directory. Then reopen it and add N docs.

> The debug output starts with an optimize() call, which first flushes
> 372 docs to segment _7f; this is the first segment in the index.  Had
> you opened this writer with create=true?

I open the writer with true when the app creates its initial repository.
Then the writer is added to and oped without create=true.

> This optimize() does nothing because the index has only one segment
> (_7f) in compound file format, so it's already optimized.  Then the
> writer is closed.
> Then this is printed:
>  <DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
> Which is odd because 191 != 372.  Can't explain that difference...

That's because an item can have several attributes that get indexed, each 
becoming a Lucene document (an item is a Chandler object).

> Then another index writer is opened, 5 docs are added, then optimize()
> is called, which flushes 5 docs to segment _7g and converts it to
> compound file format.
> Finally we try to merge _7f and _7g for optimize, and we hit the EOF
> exception trying to read the term vector for a doc from one of these
> two segments.

Ok, this could explain why the test is passing. In the test I only do one 
batch of indexing, not several like here. I missed that difference. My 
apologies. I'm going to change my test now and report back...

Thank you for the explanations.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message