lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nigel <>
Subject Re: Optimizing unordered queries
Date Mon, 29 Jun 2009 13:33:26 GMT
On Mon, Jun 29, 2009 at 6:28 AM, Michael McCandless <> wrote:

> On Sun, Jun 28, 2009 at 9:08 PM, Nigel<> wrote:
> >> Unfortunately the TermInfos must still be hit to look up the
> >> freq/proxOffset in the postings files.
> >
> > But for that data you only have to hit the TermInfos for the terms you're
> > searching, correct?  So, assuming that there are vastly more terms in the
> > index than you ever actually use in a query, we could rely on the LRU
> cache
> > to keep the queried TermInfos around, rather than loading all of them
> > up-front.  This was a hypothesis based on some tracing through the code
> but
> > not a lot of knowledge of Lucene internals, so please steer me back to
> > reality if necessary.  (-:
> Right, it's only the terms in your query that need to be looked up.
> There's already an LRU cache for the Term -> TermInfos lookup, in
> TermInfosReader (hard wired to size 1024).  It was created as a "short
> term" cache so that queries that look up the same term twice (eg once
> for idf and once to get freq/prox postings position), would result in
> only one real lookup.  You could try privately experimenting w/ that
> value to see if you can get a reasonable it rate?

Exactly, the TermInfosReader cache is what I was thinking of.  Thanks for
the confirmation; I'll give that a try.

Lucene only loads the "indexed" terms (= every 128th) into RAM, by default.

Ah, I was confused by the index divisor being 1 by default: I thought it
meant that all terms were being loaded.  I see now in SegmentTermEnum that
the every-128th behavior is implemented at a lower level.

But I'm even more confused about why we have so many terms in memory.  A
heap dump shows over 270 million TermInfos, so if that's only 128th of the
total then we REALLY have a lot of terms.  (-:  We do have a lot of docs
(about 250 million), and we do have a couple unique per-document values, but
even so I can't see how we could get to 270 million x 128 terms.  (The heap
dump numbers are stable across the index close-and-reopen cycle, so I don't
think we're leaking.)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message