lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bruce Ritchie <br...@jivesoftware.com>
Subject unexpected behavior with reader.terms(term) not folowing contract
Date Mon, 08 Sep 2003 23:31:46 GMT
All,

I've been investigating a possible improvement to the DateFilter and have run into an issue
I 
believe is a bug with Lucene 1.3 RC1.

Synopsis:

I'm trying to add a clause into the bits(IndexReader reader) of the DateFilter class to eliminate
a 
  compareTo() test and improve performance. This should be allowed whenever the DateFilter
has a 
start date but no end date since reader.terms(term) says all terms after the given term will
be 
greater than those that precede it. Thus, if we see that the endDate is equal to 
DateField.MAX_DATE_STRING() we should be able to skip the "while (enum.term().compareTo(stop)
<= 0)" 
test and improve the performance of the filter with a large document set.

/** Returns an enumeration of all terms after a given term.
     The enumeration is ordered by Term.compareTo().  Each term
     is greater than all that precede it in the enumeration.
    */
public abstract TermEnum terms(Term t) throws IOException;


Problem:

The above contract does not seem to be true in my testing. The modified DateFilter.bits(..)
method 
attached seems to show that enum.next() will indeed return a term that is less than all terms

preceeding it in the enumeration.

With my current index I create a DateFilter via filter = new DateFilter.After("creationDate",

afterDate); where afterDate is set to Sept 07 00:00:00 EDT 2003

The output from my debugging statement is as follows:

setting bit enabled for doc 466305, date Sun Sep 07 00:00:02 EDT 2003, term text was 0dkaji5zk
setting bit enabled for doc 466306, date Sun Sep 07 00:00:05 EDT 2003, term text was 0dkaji8aw
setting bit enabled for doc 466620, date Sun Sep 07 00:00:13 EDT 2003, term text was 0dkajieh4
setting bit enabled for doc 472854, date Sun Sep 07 00:00:15 EDT 2003, term text was 0dkajig0o
setting bit enabled for doc 472855, date Sun Sep 07 00:00:27 EDT 2003, term text was 0dkajipa0
setting bit enabled for doc 467844, date Sun Sep 07 00:00:58 EDT 2003, term text was 0dkajjd74
<snipped for bevity)
setting bit enabled for doc 474111, date Sun Sep 07 17:37:52 EDT 2003, term text was 0dkblajr4
setting bit enabled for doc 474112, date Sun Sep 07 17:38:01 EDT 2003, term text was 0dkblaqp4
setting bit enabled for doc 474044, date Sun Sep 07 17:38:09 EDT 2003, term text was 0dkblawvc
setting bit enabled for doc 474091, date Sun Sep 07 18:00:57 EDT 2003, term text was 0dkbm48fr
setting bit enabled for doc 84, date Wed Dec 31 19:00:00 EST 1969, term text was 10
setting bit enabled for doc 85, date Wed Dec 31 19:00:00 EST 1969, term text was 10
setting bit enabled for doc 86, date Wed Dec 31 19:00:00 EST 1969, term text was 10
<and so on and so forth>


 From the above debug logging you can see that enum.next() has returned a TermEnum with a
text of 
'10'. While this is logically greater than or equal to the preceeding text according to 
String.compareTo(), I'm uncertain as to where the '10' text is coming from. As an example,
document 
#86 returns in another search the following:

setting bit enabled for doc 86, date Fri Jul 11 17:08:43 EDT 2003, term text was 0di0opnjs

If someone could either point me in the correct direction and/or isolate the bug it would
be 
appreciated.



Regards,

Bruce Ritchie

Mime
View raw message