Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 98152 invoked from network); 15 Jun 2005 02:52:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 15 Jun 2005 02:52:54 -0000 Received: (qmail 30010 invoked by uid 500); 15 Jun 2005 02:52:50 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 29973 invoked by uid 500); 15 Jun 2005 02:52:49 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 29951 invoked by uid 99); 15 Jun 2005 02:52:49 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=RCVD_BY_IP X-Spam-Check-By: apache.org Received-SPF: pass (hermes.apache.org: local policy) Received: from www2.ziplip.com (HELO ziplip.com) (128.242.109.116) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 14 Jun 2005 19:52:45 -0700 Received: from 10.1.0.21 (EHLO 10.1.0.21 10.1.0.21 [10.1.0.21] (may be forged)) by 10.1.0.21 with ESMTP id BYH5LRP1MILUPXLUD3DWOJAFINKPOKH4MIJ1L5B0 for ; 14 Jun 2005 19:49:07 -0700 (PDT) Message-ID: Date: Tue, 14 Jun 2005 19:49:07 -0700 (PDT) From: Arvind Srinivasan Reply-To: Arvind Srinivasan To: Subject: Data Integrity Rules Cc: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-ZLPwdHint: X-ZLExpiry: -1 X-ZLReceiptConfirm: N X-Mailer: ZipLip v4.2 X-Virus-Checked: Checked X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N HI, In an earlier article, Doug Cutting described a method to verify a segments integrity by simply merging the segment to a NullDirectory. We have found several instances where the segement Corrupts even if it passes the NullDirectory TEST. Merging with NULL Directory only protects us against disk errors. There are structural errors that makes the segments corrupt after a few iterations of merges. I would like to define a simple rule: "A segment has data integrity if and only if the segment is readable and successively mergeable without any errors." For example, in the current version, you can add an empty string into the DocumentWriter. This is not a problem so long as it is readable and successively mergeable. But, after a few merge iterations, the segment merge errors with a "term out of order" exception in TermInfosWriter. Now you have an inoperable Search Engine. GRANTED, the tokenizer is at fault, but a simple issue like that should not bring the search engine down. Similarly, we have found instances where term postings having Zero frequency (NOT sure how it got in that state) and having document ids greater than the max doc of the segement. See earlier posting or Bug (a). Therefore I suggest a few more checks into DocumentWriter right after line "283" in DocumentWriter.java. if (posting.term.text.length()==0) { continue; } // add an entry to the freq file int postingFreq = posting.freq; if (postingFreq <= 0) { continue; } --- Also, please apply the changes to SegmentMerger as suggested in bug 23650. I also think, we should create test cases that keep the segments robust and not derailed by edge cases. See ALSO (a)http://issues.apache.org/bugzilla/show_bug.cgi?id=23650 (b)http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200505.mbox/%3cN0H4L0B1EFP3JVL4EBFRBUMHLSJ3MAAGKCN2OMDT@ziplip.com%3e (c) http://issues.apache.org/bugzilla/show_bug.cgi?id=35029 Thanks, Arvind. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org