Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 20985 invoked from network); 29 Sep 2007 13:02:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 29 Sep 2007 13:02:46 -0000 Received: (qmail 31545 invoked by uid 500); 29 Sep 2007 13:02:34 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 31490 invoked by uid 500); 29 Sep 2007 13:02:34 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 31479 invoked by uid 99); 29 Sep 2007 13:02:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Sep 2007 06:02:34 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of grant.ingersoll@gmail.com designates 66.249.82.233 as permitted sender) Received: from [66.249.82.233] (HELO wx-out-0506.google.com) (66.249.82.233) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 29 Sep 2007 13:02:33 +0000 Received: by wx-out-0506.google.com with SMTP id i28so2262021wxd for ; Sat, 29 Sep 2007 06:02:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; bh=dsZUidehjTy7u6MqZ6VGF1uXvRo0M4tKt+OOponDkTw=; b=EeeH+ergsdy1gSoBiaNiGazal14l9glfXxZP7dmL+0ALHp5bRbKDKdxupEcgU8HK3LFDnGfq+62HWI6a7b5Mc+tM6mrK9nvDeK27pHzVxSSMCIgrOn9nUs3/kse1MUauBrzBZdFLsfJT9rC/ZeIGSN1elosEb6AbmzPcP4Z/vCc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:mime-version:in-reply-to:references:content-type:message-id:content-transfer-encoding:from:subject:date:to:x-mailer; b=Muc0WWDGdidzqlrnv3mGUOjKR/7T5c9UwuaTsuh42MOlUlwLhh/s6//SpXB4Z429nC8VO+ppoiT9IWTWJQ3bh/ZTbat9I2TsbXN2rcM7Wc8FJdO3YdxmI3E3QQ7s7XPeT5noKbLC186/1rXKFt18R5pbgtxXdmp9/spLlHoE7d4= Received: by 10.70.52.2 with SMTP id z2mr5653035wxz.1191070929085; Sat, 29 Sep 2007 06:02:09 -0700 (PDT) Received: from ?192.168.0.3? ( [74.229.189.244]) by mx.google.com with ESMTPS id c15sm4696529anc.2007.09.29.06.02.07 (version=TLSv1/SSLv3 cipher=OTHER); Sat, 29 Sep 2007 06:02:07 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v752.3) In-Reply-To: <1191069310.14069.1213242389@webmail.messagingengine.com> References: <1188046473.28037.1207227453@webmail.messagingengine.com> <1191069310.14069.1213242389@webmail.messagingengine.com> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <072FB914-C1BA-4C09-9691-B5403DB4332D@gmail.com> Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: possible bug with indexing with term vectors Date: Sat, 29 Sep 2007 09:02:02 -0400 To: java-dev@lucene.apache.org X-Mailer: Apple Mail (2.752.3) X-Virus-Checked: Checked by ClamAV on apache.org There are a couple of JIRA issues related to TVs as well, mostly edge cases, but Andi might want to take a look at them to see if they describe his situation. -Grant On Sep 29, 2007, at 8:35 AM, Michael McCandless wrote: > > You are right Grant -- good catch!!! I have a unit test showing it > now. Thank you :) > > So, this case is tickled if you have a doc (or docs) that have some > fields with term vectors enabled, but then later as part of the same > buffered set of docs you have 1 or more docs that have no fields with > term vectors enabled. > > I'll fix it. > > The thing is, from Andi's description I'm not sure this is the case > he's hitting? He said all docs have 5 fields, one of them with term > vectors enabled ... hmmm. > > Mike > > On Sat, 29 Sep 2007 07:59:13 -0400, "Grant Ingersoll" > said: >> Hmmm, not sure, but in looking at DocumentsWriter, it seems like >> lines around 553 might be at issue: >> if (tvx != null) { >> tvx.writeLong(tvd.getFilePointer()); >> if (numVectorFields > 0) { >> tvd.writeVInt(numVectorFields); >> for(int i=0;i> tvd.writeVInt(vectorFieldNumbers[i]); >> assert 0 == vectorFieldPointers[0]; >> tvd.writeVLong(tvf.getFilePointer()); >> long lastPos = vectorFieldPointers[0]; >> for(int i=1;i> long pos = vectorFieldPointers[i]; >> tvd.writeVLong(pos-lastPos); >> lastPos = pos; >> } >> tvfLocal.writeTo(tvf); >> tvfLocal.reset(); >> } >> } >> >> Specifically, the exception being thrown seems to be that it is >> trying to read in a vInt that contains the number of fields that have >> vectors. However, in DocumentsWriter, it only writes out this vInt >> if the numVectorFields is > 0. >> >> I think you might try: >> if (numVectorFields > 0){ >> .... >> } >> else{ >> tvd.writeVInt(0) >> } >> >> In the old TermVectorsWriter, it used to be: >> private void writeDoc() throws IOException { >> if (isFieldOpen()) >> throw new IllegalStateException("Field is still open while >> writing document"); >> //System.out.println("Writing doc pointer: " + >> currentDocPointer); >> // write document index record >> tvx.writeLong(currentDocPointer); >> >> // write document data record >> final int size = fields.size(); >> >> // write the number of fields >> tvd.writeVInt(size); >> >> // write field numbers >> for (int i = 0; i < size; i++) { >> TVField field = (TVField) fields.elementAt(i); >> tvd.writeVInt(field.number); >> } >> >> http://svn.apache.org/viewvc/lucene/java/tags/lucene_2_2_0/src/java/ >> org/apache/lucene/index/TermVectorsWriter.java?view=markup >> >> >> >> On Sep 28, 2007, at 4:26 PM, Andi Vajda wrote: >> >>> >>> On Fri, 28 Sep 2007, Andi Vajda wrote: >>> >>>> I found a bug with indexing documents that contain fields with >>>> Term Vectors. The indexing fails with 'reading past EOF' errors in >>>> what seems the index optimizing phase during addIndexes(). (I >>>> index first into a RAMDirectory, then addIndexes() into an >>>> FSDIrectory). >>>> >>>> I have not filed the bug yet formally as I need to isolate the >>>> code. If I turn indexing with term vectors off, indexing completes >>>> fine. >>> >>> I tried all morning to isolate the problem but I seem to be unable >>> to reproduce it in a simple unit test. In my application, I've been >>> able to get errors by doing even less: just creating a FSDirectory >>> and adding documents with fields with term vectors fails when >>> optimizing the index with the error below. I even tried to add the >>> same documents, in the same order, in the unit test but to no >>> avail. It just works. >>> >>> What is different about my environment ? Well, I'm running >>> PyLucene, but the new one, the one using a Apple's Java VM, the >>> same VM I'm using to run the unit test. And I'm not doing anything >>> special like calling back into Python or something, I'm just >>> calling regular Lucene APIs adding documents into an IndexWriter on >>> an FSDirectory using a StandardAnalyzer. If I stop using term >>> vectors, all is working fine. >>> >>> I'd like to get to the bottom of this but could use some help. Does >>> the stacktrace below ring a bell ? Is there a way to run the whole >>> indexing and optimizing in one single thread ? >>> >>> Thanks ! >>> >>> Andi.. >>> >>> Exception in thread "Thread-4" org.apache.lucene.index.MergePolicy >>> $MergeException: java.io.IOException: read past EOF >>> at org.apache.lucene.index.ConcurrentMergeScheduler >>> $MergeThread.run(ConcurrentMergeScheduler.java:263) >>> Caused by: java.io.IOException: read past EOF >>> at org.apache.lucene.store.BufferedIndexInput.refill >>> (BufferedIndexInput.java:146) >>> at org.apache.lucene.store.BufferedIndexInput.readByte >>> (BufferedIndexInput.java:38) >>> at org.apache.lucene.store.IndexInput.readVInt >>> (IndexInput.java:76) >>> at org.apache.lucene.index.TermVectorsReader.get >>> (TermVectorsReader.java:207) >>> at org.apache.lucene.index.SegmentReader.getTermFreqVectors >>> (SegmentReader.java:692) >>> at org.apache.lucene.index.SegmentMerger.mergeVectors >>> (SegmentMerger.java:279) >>> at org.apache.lucene.index.SegmentMerger.merge >>> (SegmentMerger.java:122) >>> at org.apache.lucene.index.IndexWriter.mergeMiddle >>> (IndexWriter.java:2898) >>> at org.apache.lucene.index.IndexWriter.merge >>> (IndexWriter.java:2647) >>> at org.apache.lucene.index.ConcurrentMergeScheduler >>> $MergeThread.run(ConcurrentMergeScheduler.java:232) >>> java.io.IOException: background merge hit exception: _5u:c372 >>> _5v:c5 into _5w [optimize] >>> at org.apache.lucene.index.IndexWriter.optimize >>> (IndexWriter.java:1621) >>> at org.apache.lucene.index.IndexWriter.optimize >>> (IndexWriter.java:1571) >>> Caused by: java.io.IOException: read past EOF >>> at org.apache.lucene.store.BufferedIndexInput.refill >>> (BufferedIndexInput.java:146) >>> at org.apache.lucene.store.BufferedIndexInput.readByte >>> (BufferedIndexInput.java:38) >>> at org.apache.lucene.store.IndexInput.readVInt >>> (IndexInput.java:76) >>> at org.apache.lucene.index.TermVectorsReader.get >>> (TermVectorsReader.java:207) >>> at org.apache.lucene.index.SegmentReader.getTermFreqVectors >>> (SegmentReader.java:692) >>> at org.apache.lucene.index.SegmentMerger.mergeVectors >>> (SegmentMerger.java:279) >>> at org.apache.lucene.index.SegmentMerger.merge >>> (SegmentMerger.java:122) >>> at org.apache.lucene.index.IndexWriter.mergeMiddle >>> (IndexWriter.java:2898) >>> at org.apache.lucene.index.IndexWriter.merge >>> (IndexWriter.java:2647) >>> at org.apache.lucene.index.ConcurrentMergeScheduler >>> $MergeThread.run(ConcurrentMergeScheduler.java:232) >>> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-dev-help@lucene.apache.org >>> >> >> ------------------------------------------------------ >> Grant Ingersoll >> http://www.grantingersoll.com/ >> http://lucene.grantingersoll.com >> http://www.paperoftheweek.com/ >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org