lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 13:40:46 GMT
On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
<> wrote:

>> maybe thats because we have one huge monolithic implementation
> Doesn't the DocValues branch solve this?

Hopefully DocValues will replace FieldCache over time; maybe some day
we can deprecate & remove FieldCache.

But we still have work to do there, I believe; eg we don't have
comparators for all types (on the docvalues branch) yet.

> Also, instead of trying to implement clever ways of compressing
> strings in the field cache, which probably won't bare fruit, I'd
> prefer to look at [eventually] MMap'ing (using DV) the field caches to
> avoid the loading and heap costs, which are signifcant.  I'm not sure
> if we can easily MMap packed ints and the shared byte[], though it
> seems fairly doable?

In fact, the packed ints and the byte[] packing of terms data is very
much amenable/necessary for using MMap, far moreso than the separate
objects we had before.

I agree we should make an mmap option, though I would generally
recommend against apps using mmap for these caches.  We load these
caches so that we'll have fast random access to potentially a great
many documents during collection of one query (eg for sorting).  When
you mmap them you let the OS decide when to swap stuff out which mean
you pick up potentially high query latency waiting for these pages to
swap back in.  Various other data structures in Lucene needs this fast
random access (norms, del docs, terms index) and that's why we put
them in RAM.  I do agree for all else (the laaaarge postings), MMap is

Of course the OS swaps out process RAM anyway, so... it's kinda moot
(unless you've fixed your OS to not do this, which I always do!).

I think a more productive area of exploration (to reduce RAM usage)
would be to make a StringFieldComparator that doesn't need full access
to all terms data, ie, operates per segment yet only does a "few" ord
lookups when merging the results across segments.  If "few" is small
enough we can just use us the seek-by-ord from the terms dict to do
them.  This would be a huge RAM reduction because we could then sort
by string fields (eg "title" field) without needing all term bytes
randomly accessible.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message