lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andi Vajda <va...@osafoundation.org>
Subject Re: possible bug with indexing with term vectors
Date Sat, 29 Sep 2007 01:44:53 GMT

On Fri, 28 Sep 2007, Michael McCandless wrote:

>> I tried all morning to isolate the problem but I seem to be unable
>> to reproduce it in a simple unit test. In my application, I've been
>> able to get errors by doing even less: just creating a FSDirectory
>> and adding documents with fields with term vectors fails when
>> optimizing the index with the error below. I even tried to add the
>> same documents, in the same order, in the unit test but to no
>> avail. It just works.
>
> Are you trying your unit test first in Python (using PyLucene)?

No, I wrote it in Java to begin with. But it's a good idea. I should try it 
from PyLucene too.

>> What is different about my environment ? Well, I'm running PyLucene,
>> but the new one, the one using a Apple's Java VM, the same VM I'm
>> using to run the unit test. And I'm not doing anything special like
>> calling back into Python or something, I'm just calling regular
>> Lucene APIs adding documents into an IndexWriter on an FSDirectory
>> using a StandardAnalyzer. If I stop using term vectors, all is
>> working fine.
>
> Are your documents irregular wrt term vectors?  (Ie some docs have
> none, others store the terms but not positions/offsets, etc?).  Any
> interesting changes to Lucene's defaults (autoCommit=false, etc)?

All default config values, no config changes. All documents follow the same 
pattern of having five fields, one with term vectors.

>> I'd like to get to the bottom of this but could use some help. Does
>> the stacktrace below ring a bell ? Is there a way to run the whole
>> indexing and optimizing in one single thread ?
>
> You can easily turn off the concurrent (background) merges by doing
> this:
>
>  writer.setMergeScheduler(new SerialMergeScheduler())
>
> though that probably isn't punched through to Python in PyLucene.  You
> can also build a Lucene JAR w/ a small change to IndexWriter.java to
> do the same thing.

The new PyLucene is built with a code generator and all public APIs and 
classes are made available to Python. SerialMergeScheduler is available.

> That stacktrace is happening while merging term vectors during an
> optimize.  It's specifically occuring when loading the term vectors
> for a given doc X; we read a position from the index stream (tvx) just
> fine, but then when we try to read the first vInt from the document
> stream (tvd) we hit the EOF exception.  So that position was too large
> or the tvd file was somehow truncated.  Weird.
>
> Can you call "writer.setInfoStream(System.out)" and get the error to
> occur and then post the resulting log?  It may shed some light
> here....

I called writer.setMergeScheduler(SerialMergeScheduler()) just after creating 
the writer and called writer.setInfoStream(System.out) just before calling 
optimize(). Below is what I get:

IW [Thread-0]:   flush: segment=_7f docStoreSegment=_7f docStoreOffset=0 
flushDocs=true flushDeletes=false flushDocStores=true numDocs=372 
numBufDelTerms=0
IW [Thread-0]:   index before flush

closeDocStore: 5 files to flush to segment _7f

flush postings as segment _7f numDocs=372
   oldRAMSize=75040 newFlushedSize=21018 docs/MB=18,558.867 new/old=28.009%
IW [Thread-0]: checkpoint: wrote segments file "segments_9j"
Deleter [Thread-0]: now checkpoint "segments_9j" [1 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.fnm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.frq": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.prx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tis": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tii": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.nrm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvf": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.tvd": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.fdx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7f.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9i"
Deleter [Thread-0]:   DecRef "segments_9i": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9i"
IW [Thread-0]: checkpoint: wrote segments file "segments_9k"
Deleter [Thread-0]: now checkpoint "segments_9k" [1 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9j"
Deleter [Thread-0]:   DecRef "_7f.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fnm"
Deleter [Thread-0]:   DecRef "_7f.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.frq"
Deleter [Thread-0]:   DecRef "_7f.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.prx"
Deleter [Thread-0]:   DecRef "_7f.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tis"
Deleter [Thread-0]:   DecRef "_7f.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tii"
Deleter [Thread-0]:   DecRef "_7f.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.nrm"
Deleter [Thread-0]:   DecRef "_7f.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvx"
Deleter [Thread-0]:   DecRef "_7f.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvf"
Deleter [Thread-0]:   DecRef "_7f.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.tvd"
Deleter [Thread-0]:   DecRef "_7f.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdx"
Deleter [Thread-0]:   DecRef "_7f.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7f.fdt"
Deleter [Thread-0]:   DecRef "segments_9j": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9j"
IW [Thread-0]: optimize: index now _7f:c372
IW [Thread-0]:   flush: segment=null docStoreSegment=null docStoreOffset=0 
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372
IW [Thread-0]: at close: _7f:c372
<DBRepositoryView: Lucene (1)> indexed 191 items in 0:00:00.413600
IW [Thread-0]:   flush: segment=_7g docStoreSegment=_7g docStoreOffset=0 
flushDocs=true flushDeletes=false flushDocStores=true numDocs=5 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372

closeDocStore: 5 files to flush to segment _7g

flush postings as segment _7g numDocs=5
   oldRAMSize=76960 newFlushedSize=437 docs/MB=11,997.437 new/old=0.568%
IW [Thread-0]: checkpoint: wrote segments file "segments_9l"
Deleter [Thread-0]: now checkpoint "segments_9l" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.fnm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.frq": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.prx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tis": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tii": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.nrm": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvf": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.tvd": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.fdx": pre-incr count is 0
Deleter [Thread-0]:   IncRef "_7g.fdt": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9k"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "segments_9k": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9k"
IW [Thread-0]: checkpoint: wrote segments file "segments_9m"
Deleter [Thread-0]: now checkpoint "segments_9m" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 0
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9l"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.fnm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fnm"
Deleter [Thread-0]:   DecRef "_7g.frq": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.frq"
Deleter [Thread-0]:   DecRef "_7g.prx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.prx"
Deleter [Thread-0]:   DecRef "_7g.tis": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tis"
Deleter [Thread-0]:   DecRef "_7g.tii": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tii"
Deleter [Thread-0]:   DecRef "_7g.nrm": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.nrm"
Deleter [Thread-0]:   DecRef "_7g.tvx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvx"
Deleter [Thread-0]:   DecRef "_7g.tvf": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvf"
Deleter [Thread-0]:   DecRef "_7g.tvd": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.tvd"
Deleter [Thread-0]:   DecRef "_7g.fdx": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdx"
Deleter [Thread-0]:   DecRef "_7g.fdt": pre-decr count is 1
Deleter [Thread-0]: delete "_7g.fdt"
Deleter [Thread-0]:   DecRef "segments_9l": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9l"
IW [Thread-0]: optimize: index now _7f:c372 _7g:c5
IW [Thread-0]: add merge to pendingMerges: _7f:c372 _7g:c5 [optimize] [total 1 
pending]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 1
IW [Thread-0]: now merge
   merge=_7f:c372 _7g:c5 into _7h [optimize]
   index=_7f:c372 _7g:c5
IW [Thread-0]: merging _7f:c372 _7g:c5 into _7h [optimize]
IW [Thread-0]: merge: total 377 docs
IW [Thread-0]: hit exception during merge; now refresh deleter on segment _7h
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fdt"
Deleter [Thread-0]: delete "_7h.fdt"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fdx"
Deleter [Thread-0]: delete "_7h.fdx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.fnm"
Deleter [Thread-0]: delete "_7h.fnm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.frq"
Deleter [Thread-0]: delete "_7h.frq"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.nrm"
Deleter [Thread-0]: delete "_7h.nrm"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.prx"
Deleter [Thread-0]: delete "_7h.prx"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tii"
Deleter [Thread-0]: delete "_7h.tii"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tis"
Deleter [Thread-0]: delete "_7h.tis"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvd"
Deleter [Thread-0]: delete "_7h.tvd"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvf"
Deleter [Thread-0]: delete "_7h.tvf"
Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced 
file "_7h.tvx"
Deleter [Thread-0]: delete "_7h.tvx"
IW [Thread-0]: hit exception during merge
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.cfs": pre-decr count is 2
java.io.IOException: read past EOF
         at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146)
         at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
         at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
         at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207)
         at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692)
         at 
org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279)
         at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
         at 
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898)
         at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647)
         at 
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37)
         at 
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1668)
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1604)
         at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571)

IW [Thread-0]:   flush: segment=null docStoreSegment=null docStoreOffset=0 
flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 
numBufDelTerms=0
IW [Thread-0]:   index before flush _7f:c372 _7g:c5
IW [Thread-0]: close: wrote segments file "segments_9n"
Deleter [Thread-0]: now checkpoint "segments_9n" [2 segments ; isCommit = 
true]
Deleter [Thread-0]:   IncRef "_7f.cfs": pre-incr count is 1
Deleter [Thread-0]:   IncRef "_7g.cfs": pre-incr count is 1
Deleter [Thread-0]: deleteCommits: now remove commit "segments_9m"
Deleter [Thread-0]:   DecRef "_7f.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "_7g.cfs": pre-decr count is 2
Deleter [Thread-0]:   DecRef "segments_9m": pre-decr count is 1
Deleter [Thread-0]: delete "segments_9m"
IW [Thread-0]: at close: _7f:c372 _7g:c5

Andi..


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message