lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: svn commit: r332747 - in /lucene/java/trunk: ./ src/java/org/apache/lucene/search/regex/ src/test/org/apache/lucene/search/regex/
Date Wed, 16 Nov 2005 19:03:29 GMT
Yonik Seeley wrote:
> Totally untested, but here is a hack at what the scorer might look
> like when the number of terms is large.

Looks plausible to me.

You could instead use a byte[maxDoc] and encode/decode floats as you 
store and read them, to use a lot less RAM.

>   // could also use a bitset to keep track of docs in the set...

I think that is probably a very important optimization.

If you implemented both of these suggestions, this would use 5 bits/doc, 
instead of 33 bits/doc.  With a 100M doc index, that would be the 
difference between 62MB/query and 412MB/query.  The classic term 
expanding approach uses perhaps 2k/term.  So, with a 100M document 
index, the byte array approach uses less memory for queries which expand 
to more than 3,100 terms.  The float-array method uses less memory for 
queries with more than 206k terms.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message