lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arvind Srinivasan <luc...@ziplip.com>
Subject More Data Integrity Issue: Docs out of Order
Date Tue, 14 Jun 2005 01:42:18 GMT
More Data Integrity Issue: Docs out of Order

Hi,
Seeing similar issue to the one reported in:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23650
On examining the segments, following inconsistencies were found
(a) The merging segments had doc number that is greater than maxDoc.
Don't know how it go in this state, but this occurs using standard lucene
code.
(b) Strangely, some documents had terms with zero frequency.  And when it occurred,
the zero frequency term has several posting as (docid 0)
Example.. (docid freq)  -- MaxDoc = 7749 - NO DELETION.
Merging msgBody; text=it; sz=2  --- The field name is msgBody and term is "it" 
                                    and two segments have the term.
(0 0)(0 0)(0 0)..........(0 0)(4 6)(5 3)(6 1)(9 1)(10 2)(12 1)......
...(6791 2)(6794 3)(6796 2)(6798 16)(6801 1)(6805 1)(6806 5)
(6808 1)(6810 1)(6815 2)(6816 3)(6817 1)(6818 1)(6821 4)(6822 1)
(6824 3)(6826 4)(6828 1)(6829 3)(12549 2)doc exceeds count
749(13570 1)doc exceeds count 7749(14896 1)doc exceeds count 7749
(15028 1)doc exceeds count 7749(15357 1)doc exceeds count 7749
(15427 1)doc exceeds count 7749(15534 1)doc exceeds count 7749
(15535 1)doc exceeds count 7749(15653 1)doc exceeds count 7749
(16530 1)doc exceeds count 7749(17108 1).......
(c) Also the zero frequency was not limited to the 0 document, there was 
another instance.

One work around that seemed to resolve the issue was to:
(a) keep the maxDoc as a member variable in SegmentMergeInfo 
and ignore/throw exception if an inconsistent state is detected.

****ADD To SegmentMerger just before "docs out of order" check.
  if (postings.freq() == 0) {
            continue;
   }
   if (doc >= smi.maxDoc) {
      //sbLog.append("doc exceeds count \r\n " + smi.maxDoc);
      continue;
   }
****

Atleast putting a check would not corrupt the segments and would
get us closer to the real problem as to why freq=0 and doc number exceeds
maxDoc. Note, the code has had the fix to the other Segment corruption issue
that I previously reported (namely, Using a zero length file). 

Arvind.
Mime
View raw message