lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: possible bug with indexing with term vectors
Date Sat, 29 Sep 2007 11:59:13 GMT
Hmmm, not sure, but in looking at DocumentsWriter, it seems like  
lines around 553 might be at issue:
if (tvx != null) {
         tvx.writeLong(tvd.getFilePointer());
         if (numVectorFields > 0) {
           tvd.writeVInt(numVectorFields);
           for(int i=0;i<numVectorFields;i++)
             tvd.writeVInt(vectorFieldNumbers[i]);
           assert 0 == vectorFieldPointers[0];
           tvd.writeVLong(tvf.getFilePointer());
           long lastPos = vectorFieldPointers[0];
           for(int i=1;i<numVectorFields;i++) {
             long pos = vectorFieldPointers[i];
             tvd.writeVLong(pos-lastPos);
             lastPos = pos;
           }
           tvfLocal.writeTo(tvf);
           tvfLocal.reset();
         }
       }

Specifically, the exception being thrown seems to be that it is  
trying to read in a vInt that contains the number of fields that have  
vectors.  However, in DocumentsWriter, it only writes out this vInt  
if the numVectorFields is > 0.

I think you might try:
if (numVectorFields > 0){
....
}
else{
tvd.writeVInt(0)
}

In the old TermVectorsWriter, it used to be:
  private void writeDoc() throws IOException {
     if (isFieldOpen())
       throw new IllegalStateException("Field is still open while  
writing document");
     //System.out.println("Writing doc pointer: " + currentDocPointer);
     // write document index record
     tvx.writeLong(currentDocPointer);

     // write document data record
     final int size = fields.size();

     // write the number of fields
     tvd.writeVInt(size);

     // write field numbers
     for (int i = 0; i < size; i++) {
       TVField field = (TVField) fields.elementAt(i);
       tvd.writeVInt(field.number);
     }

http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ 
org/apache/lucene/index/TermVectorsWriter.java?view=markup



On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote:

>
> On Fri, 28 Sep 2007, Andi Vajda wrote:
>
>> I found a bug with indexing documents that contain fields with  
>> Term Vectors. The indexing fails with 'reading past EOF' errors in  
>> what seems the index optimizing phase during addIndexes(). (I  
>> index first into a RAMDirectory, then addIndexes() into an  
>> FSDIrectory).
>>
>> I have not filed the bug yet formally as I need to isolate the  
>> code. If I turn indexing with term vectors off, indexing completes  
>> fine.
>
> I tried all morning to isolate the problem but I seem to be unable  
> to reproduce it in a simple unit test. In my application, I've been  
> able to get errors by doing even less: just creating a FSDirectory  
> and adding documents with fields with term vectors fails when  
> optimizing the index with the error below. I even tried to add the  
> same documents, in the same order, in the unit test but to no  
> avail. It just works.
>
> What is different about my environment ? Well, I'm running  
> PyLucene, but the new one, the one using a Apple's Java VM, the  
> same VM I'm using to run the unit test. And I'm not doing anything  
> special like calling back into Python or something, I'm just  
> calling regular Lucene APIs adding documents into an IndexWriter on  
> an FSDirectory using a StandardAnalyzer. If I stop using term  
> vectors, all is working fine.
>
> I'd like to get to the bottom of this but could use some help. Does  
> the stacktrace below ring a bell ? Is there a way to run the whole  
> indexing and optimizing in one single thread ?
>
> Thanks !
>
> Andi..
>
> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy 
> $MergeException: java.io.IOException: read past EOF
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:263)
> Caused by: java.io.IOException: read past EOF
>         at org.apache.lucene.store.BufferedIndexInput.refill 
> (BufferedIndexInput.java:146)
>         at org.apache.lucene.store.BufferedIndexInput.readByte 
> (BufferedIndexInput.java:38)
>         at org.apache.lucene.store.IndexInput.readVInt 
> (IndexInput.java:76)
>         at org.apache.lucene.index.TermVectorsReader.get 
> (TermVectorsReader.java:207)
>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> (SegmentReader.java:692)
>         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> (SegmentMerger.java:279)
>         at org.apache.lucene.index.SegmentMerger.merge 
> (SegmentMerger.java:122)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> (IndexWriter.java:2898)
>         at org.apache.lucene.index.IndexWriter.merge 
> (IndexWriter.java:2647)
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
> java.io.IOException: background merge hit exception: _5u:c372  
> _5v:c5 into _5w [optimize]
>         at org.apache.lucene.index.IndexWriter.optimize 
> (IndexWriter.java:1621)
>         at org.apache.lucene.index.IndexWriter.optimize 
> (IndexWriter.java:1571)
> Caused by: java.io.IOException: read past EOF
>         at org.apache.lucene.store.BufferedIndexInput.refill 
> (BufferedIndexInput.java:146)
>         at org.apache.lucene.store.BufferedIndexInput.readByte 
> (BufferedIndexInput.java:38)
>         at org.apache.lucene.store.IndexInput.readVInt 
> (IndexInput.java:76)
>         at org.apache.lucene.index.TermVectorsReader.get 
> (TermVectorsReader.java:207)
>         at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
> (SegmentReader.java:692)
>         at org.apache.lucene.index.SegmentMerger.mergeVectors 
> (SegmentMerger.java:279)
>         at org.apache.lucene.index.SegmentMerger.merge 
> (SegmentMerger.java:122)
>         at org.apache.lucene.index.IndexWriter.mergeMiddle 
> (IndexWriter.java:2898)
>         at org.apache.lucene.index.IndexWriter.merge 
> (IndexWriter.java:2647)
>         at org.apache.lucene.index.ConcurrentMergeScheduler 
> $MergeThread.run(ConcurrentMergeScheduler.java:232)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message