lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 27587] - java.io.IOException: read past EOF when searching index
Date Mon, 15 Mar 2004 17:58:07 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=27587>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=27587

java.io.IOException: read past EOF when searching index





------- Additional Comments From cutting@apache.org  2004-03-15 17:58 -------
I have not had a chance to look at this closely yet, but it sounds to me like a
case where numSkipped is incorrectly computed.  This algorithm has proven tricky
to get right.  The first version I checked in was buggy and failed in a similar
way.  I spent a day staring at at, and came up with the current version, which
may not yet be right.

The workaround is to comment out lines 179 to 222.  If things work when you do
that, then this is probably a bug in the commented-out code.

If someone else who is good at debugging fussy algorithms has time, please look
at this.  Your prize will be much admiration from your peers.  Otherwise I'll
try to get to it when I next have time.

The skip data is written by SegmentMerger.java, lines 415 to 445.  I think that
code is correct.  It writes a sequence of
<docNumDelta,freqPointerDelta,proxPointerDelta> tuples.  Each docNumDelta is the
difference between itself and the previous docNum.  The docNums contained in
this sequence are the docNum *before* every 16th entry in the TermDocs.  The
freqPointer and proxPointer indicate the position of every 16th entry in the
TermDocs and TermPositions in the .frq and .prx file, respectively.  The
sequence is stored in the .frq file, at the end of the TermDocs for each term
whose frequency is greater than 16.  I hope this makes some sense.  I still need
to add this to the 1.4 file format documentation...

The sequence is read only by TermDocs.skipTo(), to enable skipping ahead by 16
entries at a time, which can accelerate many kinds of queries.  This is the
logic that has proven tricky.

Any volunteers looking for hacker points?

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message