lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kroehling, Thomas" <thomas.kroehl...@coremedia.com>
Subject Question about termDocs.read(docs, freqs)
Date Tue, 19 Sep 2006 11:43:20 GMT
Hi,
I am trying to write a WildcardFilter in order to prevent
TooManyBooleanClauses and high memory usage. I wrap a Filter in a
ConstantScoreQuery. I enumerate over the WildcardTerms for a query. This
way I can set a maximum number of terms which i will evaluate. If too
many terms match, I throw an exception. I also have a maximum number of
documents which are allowed to match using BitsSets cardinality. I am
not sure, if that is necessary, but I thought, if only a few terms, but
a few million documents might match, that could also considerably slow
down a search. 
I thought, I could get the TermDocs for each WildcardTerm and use:

int count = termDocs.read(docs, freqs);

In order to have an optimized way to read not more than a maximum number
of documents which match this term. I would then step through docs and
set the bits for these documents. Sometimes this works, but sometimes
this returns a different number search results. 
When I search for "content:test" in my index, I find 66 documents, but
when I search for "t*st" with my WildcardFilter, I only find 23. There
is only one term "test" matching this query and searching for this term
in Luke also returns 66 documents. I found out that the SegmentTermDocs
sets a variable df to "23", which leads to stop searching any further.
Unfortunately I do not quite understand, where this variable really
comes from and what it is for.

I probably could just step through the TermDocs for each WildcardTerm.
Is that should a correct (and not dramatically slow) way to find all
documents? But I would like to understand, the difference in search
results and what the method TermDocs.read(docs, freqs) method does and
if my kind of filter does really make sense. I periodically rebuild my
index and I wonder why my WildcardFilter sometimes returns the correct
search results and sometimes not. What is the difference between steping
through the term docs with termDocs.next() and using the read-method.
Can anybodey explain that?

Thanks in advance,
Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message