lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: possible bug with indexing with term vectors
Date Sat, 29 Sep 2007 13:02:02 GMT
There are a couple of JIRA issues related to TVs as well, mostly edge  
cases, but Andi might want to take a look at them to see if they  
describe his situation.

-Grant

On Sep 29, 2007, at 8:35 AM, Michael McCandless wrote:

>
> You are right Grant -- good catch!!!  I have a unit test showing it
> now.  Thank you :)
>
> So, this case is tickled if you have a doc (or docs) that have some
> fields with term vectors enabled, but then later as part of the same
> buffered set of docs you have 1 or more docs that have no fields with
> term vectors enabled.
>
> I'll fix it.
>
> The thing is, from Andi's description I'm not sure this is the case
> he's hitting?  He said all docs have 5 fields, one of them with term
> vectors enabled ... hmmm.
>
> Mike
>
> On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll"  
> <grant.ingersoll@gmail.com> said:
>> Hmmm, not sure, but in looking at DocumentsWriter, it seems like
>> lines around 553 might be at issue:
>> if (tvx != null) {
>>          tvx.writeLong(tvd.getFilePointer());
>>          if (numVectorFields > 0) {
>>            tvd.writeVInt(numVectorFields);
>>            for(int i=0;i<numVectorFields;i++)
>>              tvd.writeVInt(vectorFieldNumbers[i]);
>>            assert 0 == vectorFieldPointers[0];
>>            tvd.writeVLong(tvf.getFilePointer());
>>            long lastPos = vectorFieldPointers[0];
>>            for(int i=1;i<numVectorFields;i++) {
>>              long pos = vectorFieldPointers[i];
>>              tvd.writeVLong(pos-lastPos);
>>              lastPos = pos;
>>            }
>>            tvfLocal.writeTo(tvf);
>>            tvfLocal.reset();
>>          }
>>        }
>>
>> Specifically, the exception being thrown seems to be that it is
>> trying to read in a vInt that contains the number of fields that have
>> vectors.  However, in DocumentsWriter, it only writes out this vInt
>> if the numVectorFields is > 0.
>>
>> I think you might try:
>> if (numVectorFields > 0){
>> ....
>> }
>> else{
>> tvd.writeVInt(0)
>> }
>>
>> In the old TermVectorsWriter, it used to be:
>>   private void writeDoc() throws IOException {
>>      if (isFieldOpen())
>>        throw new IllegalStateException("Field is still open while
>> writing document");
>>      //System.out.println("Writing doc pointer: " +  
>> currentDocPointer);
>>      // write document index record
>>      tvx.writeLong(currentDocPointer);
>>
>>      // write document data record
>>      final int size = fields.size();
>>
>>      // write the number of fields
>>      tvd.writeVInt(size);
>>
>>      // write field numbers
>>      for (int i = 0; i < size; i++) {
>>        TVField field = (TVField) fields.elementAt(i);
>>        tvd.writeVInt(field.number);
>>      }
>>
>> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/
>> org/apache/lucene/index/TermVectorsWriter.java?view=markup
>>
>>
>>
>> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:
>>
>>>
>>> On Fri, 28 Sep 2007, Andi Vajda wrote:
>>>
>>>> I found a bug with indexing documents that contain fields with
>>>> Term Vectors. The indexing fails with 'reading past EOF' errors in
>>>> what seems the index optimizing phase during addIndexes(). (I
>>>> index first into a RAMDirectory, then addIndexes() into an
>>>> FSDIrectory).
>>>>
>>>> I have not filed the bug yet formally as I need to isolate the
>>>> code. If I turn indexing with term vectors off, indexing completes
>>>> fine.
>>>
>>> I tried all morning to isolate the problem but I seem to be unable
>>> to reproduce it in a simple unit test. In my application, I've been
>>> able to get errors by doing even less: just creating a FSDirectory
>>> and adding documents with fields with term vectors fails when
>>> optimizing the index with the error below. I even tried to add the
>>> same documents, in the same order, in the unit test but to no
>>> avail. It just works.
>>>
>>> What is different about my environment ? Well, I'm running
>>> PyLucene, but the new one, the one using a Apple's Java VM, the
>>> same VM I'm using to run the unit test. And I'm not doing anything
>>> special like calling back into Python or something, I'm just
>>> calling regular Lucene APIs adding documents into an IndexWriter on
>>> an FSDirectory using a StandardAnalyzer. If I stop using term
>>> vectors, all is working fine.
>>>
>>> I'd like to get to the bottom of this but could use some help. Does
>>> the stacktrace below ring a bell ? Is there a way to run the whole
>>> indexing and optimizing in one single thread ?
>>>
>>> Thanks !
>>>
>>> Andi..
>>>
>>> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy
>>> $MergeException: java.io.IOException: read past EOF
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:263)
>>> Caused by: java.io.IOException: read past EOF
>>>         at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>>         at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>>         at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>>         at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>>         at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>>         at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>> java.io.IOException: background merge hit exception: _5u:c372
>>> _5v:c5 into _5w [optimize]
>>>         at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1621)
>>>         at org.apache.lucene.index.IndexWriter.optimize
>>> (IndexWriter.java:1571)
>>> Caused by: java.io.IOException: read past EOF
>>>         at org.apache.lucene.store.BufferedIndexInput.refill
>>> (BufferedIndexInput.java:146)
>>>         at org.apache.lucene.store.BufferedIndexInput.readByte
>>> (BufferedIndexInput.java:38)
>>>         at org.apache.lucene.store.IndexInput.readVInt
>>> (IndexInput.java:76)
>>>         at org.apache.lucene.index.TermVectorsReader.get
>>> (TermVectorsReader.java:207)
>>>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors
>>> (SegmentReader.java:692)
>>>         at org.apache.lucene.index.SegmentMerger.mergeVectors
>>> (SegmentMerger.java:279)
>>>         at org.apache.lucene.index.SegmentMerger.merge
>>> (SegmentMerger.java:122)
>>>         at org.apache.lucene.index.IndexWriter.mergeMiddle
>>> (IndexWriter.java:2898)
>>>         at org.apache.lucene.index.IndexWriter.merge
>>> (IndexWriter.java:2647)
>>>         at org.apache.lucene.index.ConcurrentMergeScheduler
>>> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>
>> ------------------------------------------------------
>> Grant Ingersoll
>> http://www.grantingersoll.com/
>> http://lucene.grantingersoll.com
>> http://www.paperoftheweek.com/
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message