Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 28283 invoked from network); 29 Sep 2007 01:45:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Sep 2007 01:45:52 -0000 Received: (qmail 54853 invoked by uid 500); 29 Sep 2007 01:45:42 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 54222 invoked by uid 500); 29 Sep 2007 01:45:40 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 54211 invoked by uid 99); 29 Sep 2007 01:45:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Sep 2007 18:45:40 -0700 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=DNS_FROM_DOB,RCVD_IN_DNSWL_MED,RCVD_IN_DOB,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of vajda@osafoundation.org designates 204.152.186.98 as permitted sender) Received: from [204.152.186.98] (HELO laweleka.osafoundation.org) (204.152.186.98) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Sep 2007 01:45:40 +0000 Received: from [192.168.0.3] (adsl-66-159-224-220.dslextreme.com [66.159.224.220]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by laweleka.osafoundation.org (Postfix) with ESMTP id AE73214220F; Fri, 28 Sep 2007 18:45:19 -0700 (PDT) Date: Fri, 28 Sep 2007 18:44:53 -0700 (PDT) From: Andi Vajda X-X-Sender: vajda@yuzu.local Reply-To: Andi Vajda To: Michael McCandless cc: java-dev@lucene.apache.org Subject: Re: possible bug with indexing with term vectors In-Reply-To: <1191014111.19126.1213175969@webmail.messagingengine.com> Message-ID: References: <1188046473.28037.1207227453@webmail.messagingengine.com> <1191014111.19126.1213175969@webmail.messagingengine.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Checked: Checked by ClamAV on apache.org On Fri, 28 Sep 2007, Michael McCandless wrote: >> I tried all morning to isolate the problem but I seem to be unable >> to reproduce it in a simple unit test. In my application, I've been >> able to get errors by doing even less: just creating a FSDirectory >> and adding documents with fields with term vectors fails when >> optimizing the index with the error below. I even tried to add the >> same documents, in the same order, in the unit test but to no >> avail. It just works. > > Are you trying your unit test first in Python (using PyLucene)? No, I wrote it in Java to begin with. But it's a good idea. I should try it from PyLucene too. >> What is different about my environment ? Well, I'm running PyLucene, >> but the new one, the one using a Apple's Java VM, the same VM I'm >> using to run the unit test. And I'm not doing anything special like >> calling back into Python or something, I'm just calling regular >> Lucene APIs adding documents into an IndexWriter on an FSDirectory >> using a StandardAnalyzer. If I stop using term vectors, all is >> working fine. > > Are your documents irregular wrt term vectors? (Ie some docs have > none, others store the terms but not positions/offsets, etc?). Any > interesting changes to Lucene's defaults (autoCommit=false, etc)? All default config values, no config changes. All documents follow the same pattern of having five fields, one with term vectors. >> I'd like to get to the bottom of this but could use some help. Does >> the stacktrace below ring a bell ? Is there a way to run the whole >> indexing and optimizing in one single thread ? > > You can easily turn off the concurrent (background) merges by doing > this: > > writer.setMergeScheduler(new SerialMergeScheduler()) > > though that probably isn't punched through to Python in PyLucene. You > can also build a Lucene JAR w/ a small change to IndexWriter.java to > do the same thing. The new PyLucene is built with a code generator and all public APIs and classes are made available to Python. SerialMergeScheduler is available. > That stacktrace is happening while merging term vectors during an > optimize. It's specifically occuring when loading the term vectors > for a given doc X; we read a position from the index stream (tvx) just > fine, but then when we try to read the first vInt from the document > stream (tvd) we hit the EOF exception. So that position was too large > or the tvd file was somehow truncated. Weird. > > Can you call "writer.setInfoStream(System.out)" and get the error to > occur and then post the resulting log? It may shed some light > here.... I called writer.setMergeScheduler(SerialMergeScheduler()) just after creating the writer and called writer.setInfoStream(System.out) just before calling optimize(). Below is what I get: IW [Thread-0]: flush: segment=_7f docStoreSegment=_7f docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=true numDocs=372 numBufDelTerms=0 IW [Thread-0]: index before flush closeDocStore: 5 files to flush to segment _7f flush postings as segment _7f numDocs=372 oldRAMSize=75040 newFlushedSize=21018 docs/MB=18,558.867 new/old=28.009% IW [Thread-0]: checkpoint: wrote segments file "segments_9j" Deleter [Thread-0]: now checkpoint "segments_9j" [1 segments ; isCommit = true] Deleter [Thread-0]: IncRef "_7f.fnm": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.frq": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.prx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.tis": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.tii": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.nrm": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.tvx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.tvf": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.tvd": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.fdx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7f.fdt": pre-incr count is 0 Deleter [Thread-0]: deleteCommits: now remove commit "segments_9i" Deleter [Thread-0]: DecRef "segments_9i": pre-decr count is 1 Deleter [Thread-0]: delete "segments_9i" IW [Thread-0]: checkpoint: wrote segments file "segments_9k" Deleter [Thread-0]: now checkpoint "segments_9k" [1 segments ; isCommit = true] Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 0 Deleter [Thread-0]: deleteCommits: now remove commit "segments_9j" Deleter [Thread-0]: DecRef "_7f.fnm": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.fnm" Deleter [Thread-0]: DecRef "_7f.frq": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.frq" Deleter [Thread-0]: DecRef "_7f.prx": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.prx" Deleter [Thread-0]: DecRef "_7f.tis": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.tis" Deleter [Thread-0]: DecRef "_7f.tii": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.tii" Deleter [Thread-0]: DecRef "_7f.nrm": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.nrm" Deleter [Thread-0]: DecRef "_7f.tvx": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.tvx" Deleter [Thread-0]: DecRef "_7f.tvf": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.tvf" Deleter [Thread-0]: DecRef "_7f.tvd": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.tvd" Deleter [Thread-0]: DecRef "_7f.fdx": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.fdx" Deleter [Thread-0]: DecRef "_7f.fdt": pre-decr count is 1 Deleter [Thread-0]: delete "_7f.fdt" Deleter [Thread-0]: DecRef "segments_9j": pre-decr count is 1 Deleter [Thread-0]: delete "segments_9j" IW [Thread-0]: optimize: index now _7f:c372 IW [Thread-0]: flush: segment=null docStoreSegment=null docStoreOffset=0 flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 numBufDelTerms=0 IW [Thread-0]: index before flush _7f:c372 IW [Thread-0]: at close: _7f:c372 indexed 191 items in 0:00:00.413600 IW [Thread-0]: flush: segment=_7g docStoreSegment=_7g docStoreOffset=0 flushDocs=true flushDeletes=false flushDocStores=true numDocs=5 numBufDelTerms=0 IW [Thread-0]: index before flush _7f:c372 closeDocStore: 5 files to flush to segment _7g flush postings as segment _7g numDocs=5 oldRAMSize=76960 newFlushedSize=437 docs/MB=11,997.437 new/old=0.568% IW [Thread-0]: checkpoint: wrote segments file "segments_9l" Deleter [Thread-0]: now checkpoint "segments_9l" [2 segments ; isCommit = true] Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1 Deleter [Thread-0]: IncRef "_7g.fnm": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.frq": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.prx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.tis": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.tii": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.nrm": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.tvx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.tvf": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.tvd": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.fdx": pre-incr count is 0 Deleter [Thread-0]: IncRef "_7g.fdt": pre-incr count is 0 Deleter [Thread-0]: deleteCommits: now remove commit "segments_9k" Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2 Deleter [Thread-0]: DecRef "segments_9k": pre-decr count is 1 Deleter [Thread-0]: delete "segments_9k" IW [Thread-0]: checkpoint: wrote segments file "segments_9m" Deleter [Thread-0]: now checkpoint "segments_9m" [2 segments ; isCommit = true] Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1 Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 0 Deleter [Thread-0]: deleteCommits: now remove commit "segments_9l" Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2 Deleter [Thread-0]: DecRef "_7g.fnm": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.fnm" Deleter [Thread-0]: DecRef "_7g.frq": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.frq" Deleter [Thread-0]: DecRef "_7g.prx": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.prx" Deleter [Thread-0]: DecRef "_7g.tis": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.tis" Deleter [Thread-0]: DecRef "_7g.tii": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.tii" Deleter [Thread-0]: DecRef "_7g.nrm": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.nrm" Deleter [Thread-0]: DecRef "_7g.tvx": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.tvx" Deleter [Thread-0]: DecRef "_7g.tvf": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.tvf" Deleter [Thread-0]: DecRef "_7g.tvd": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.tvd" Deleter [Thread-0]: DecRef "_7g.fdx": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.fdx" Deleter [Thread-0]: DecRef "_7g.fdt": pre-decr count is 1 Deleter [Thread-0]: delete "_7g.fdt" Deleter [Thread-0]: DecRef "segments_9l": pre-decr count is 1 Deleter [Thread-0]: delete "segments_9l" IW [Thread-0]: optimize: index now _7f:c372 _7g:c5 IW [Thread-0]: add merge to pendingMerges: _7f:c372 _7g:c5 [optimize] [total 1 pending] Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1 Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 1 IW [Thread-0]: now merge merge=_7f:c372 _7g:c5 into _7h [optimize] index=_7f:c372 _7g:c5 IW [Thread-0]: merging _7f:c372 _7g:c5 into _7h [optimize] IW [Thread-0]: merge: total 377 docs IW [Thread-0]: hit exception during merge; now refresh deleter on segment _7h Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.fdt" Deleter [Thread-0]: delete "_7h.fdt" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.fdx" Deleter [Thread-0]: delete "_7h.fdx" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.fnm" Deleter [Thread-0]: delete "_7h.fnm" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.frq" Deleter [Thread-0]: delete "_7h.frq" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.nrm" Deleter [Thread-0]: delete "_7h.nrm" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.prx" Deleter [Thread-0]: delete "_7h.prx" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.tii" Deleter [Thread-0]: delete "_7h.tii" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.tis" Deleter [Thread-0]: delete "_7h.tis" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.tvd" Deleter [Thread-0]: delete "_7h.tvd" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.tvf" Deleter [Thread-0]: delete "_7h.tvf" Deleter [Thread-0]: refresh [prefix=_7h]: removing newly created unreferenced file "_7h.tvx" Deleter [Thread-0]: delete "_7h.tvx" IW [Thread-0]: hit exception during merge Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2 Deleter [Thread-0]: DecRef "_7g.cfs": pre-decr count is 2 java.io.IOException: read past EOF at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:146) at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38) at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76) at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:207) at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:692) at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:279) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122) at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:2898) at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:2647) at org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:37) at org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:1668) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1604) at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:1571) IW [Thread-0]: flush: segment=null docStoreSegment=null docStoreOffset=0 flushDocs=false flushDeletes=false flushDocStores=false numDocs=0 numBufDelTerms=0 IW [Thread-0]: index before flush _7f:c372 _7g:c5 IW [Thread-0]: close: wrote segments file "segments_9n" Deleter [Thread-0]: now checkpoint "segments_9n" [2 segments ; isCommit = true] Deleter [Thread-0]: IncRef "_7f.cfs": pre-incr count is 1 Deleter [Thread-0]: IncRef "_7g.cfs": pre-incr count is 1 Deleter [Thread-0]: deleteCommits: now remove commit "segments_9m" Deleter [Thread-0]: DecRef "_7f.cfs": pre-decr count is 2 Deleter [Thread-0]: DecRef "_7g.cfs": pre-decr count is 2 Deleter [Thread-0]: DecRef "segments_9m": pre-decr count is 1 Deleter [Thread-0]: delete "segments_9m" IW [Thread-0]: at close: _7f:c372 _7g:c5 Andi.. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org