lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anders Nielsen" <and...@visator.dk>
Subject RE: Memory Usage?
Date Mon, 12 Nov 2001 21:43:08 GMT
hmm, I seem to be getting a different number of hits when I use the files
you sent out.

-----Original Message-----
From: Doug Cutting [mailto:DCutting@grandcentral.com]
Sent: 12. november 2001 20:47
To: 'Lucene Users List'
Subject: RE: Memory Usage?


> From: Anders Nielsen [mailto:anders@visator.dk]
>
> this was a big boolean query, with several prefixqueries but
> no wildcard
> queries in the or-branches.

Well it looks like those prefixes are expanding to a lot of terms, a total
of over 40,000!  (A prefix query expands into a BooleanQuery with all the
terms matching the prefix.)

If most of these expansions are low-frequency, then a simple fix should
improve things considerably.  I've attached an optimized version of
TermQuery that will hold less memory per low-frequency term.  In particular,
if a term occurs fewer than 128 times then a 1024 byte InputStream buffer is
freed immediately.

Tell me how this works.  Please send another heap dump.

Longer term, or if lots of the expanded terms occur more than 128 times,
perhaps BooleanScorer should use a different algorithm when there are
thousands of terms.  In this case it might use less memory to construct an
array of score buckets for all documents.  If (query.termCount() * 1024) >
(12 * getMaxDoc()) then this would use less memory.  In your case, with
500,000 documents and a 40,000 term query, it's currently taking 40MB/query,
and could be done in 6MB/query.  This optimization would not be too
difficult, as it could be mostly isolated to BooleanQuery and BooleanScorer.

Doug




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message