lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Rutherglen <jason.rutherg...@gmail.com>
Subject Re: FST and FieldCache?
Date Thu, 19 May 2011 14:09:02 GMT
> When
> you mmap them you let the OS decide when to swap stuff out which mean
> you pick up potentially high query latency waiting for these pages to
> swap back in

Right, however if one is using lets say SSDs, and the query time is
less important, then MMap'ing would be fine.  Also it prevents deadly
OOMs in favor of basic 'slowness' of the query.  If there is no
performance degradation I think MMap'ing is a great option.  A common
use case is an index that's far too large for a given server will
simply not work today, whereas with MMap'ed field caches the query
would complete, just extremely slowly.  If the user wishes to improve
performance it's easy enough to add more hardware.

On Thu, May 19, 2011 at 6:40 AM, Michael McCandless
<lucene@mikemccandless.com> wrote:
> On Thu, May 19, 2011 at 9:22 AM, Jason Rutherglen
> <jason.rutherglen@gmail.com> wrote:
>
>>> maybe thats because we have one huge monolithic implementation
>>
>> Doesn't the DocValues branch solve this?
>
> Hopefully DocValues will replace FieldCache over time; maybe some day
> we can deprecate & remove FieldCache.
>
> But we still have work to do there, I believe; eg we don't have
> comparators for all types (on the docvalues branch) yet.
>
>> Also, instead of trying to implement clever ways of compressing
>> strings in the field cache, which probably won't bare fruit, I'd
>> prefer to look at [eventually] MMap'ing (using DV) the field caches to
>> avoid the loading and heap costs, which are signifcant.  I'm not sure
>> if we can easily MMap packed ints and the shared byte[], though it
>> seems fairly doable?
>
> In fact, the packed ints and the byte[] packing of terms data is very
> much amenable/necessary for using MMap, far moreso than the separate
> objects we had before.
>
> I agree we should make an mmap option, though I would generally
> recommend against apps using mmap for these caches.  We load these
> caches so that we'll have fast random access to potentially a great
> many documents during collection of one query (eg for sorting).  When
> you mmap them you let the OS decide when to swap stuff out which mean
> you pick up potentially high query latency waiting for these pages to
> swap back in.  Various other data structures in Lucene needs this fast
> random access (norms, del docs, terms index) and that's why we put
> them in RAM.  I do agree for all else (the laaaarge postings), MMap is
> great.
>
> Of course the OS swaps out process RAM anyway, so... it's kinda moot
> (unless you've fixed your OS to not do this, which I always do!).
>
> I think a more productive area of exploration (to reduce RAM usage)
> would be to make a StringFieldComparator that doesn't need full access
> to all terms data, ie, operates per segment yet only does a "few" ord
> lookups when merging the results across segments.  If "few" is small
> enough we can just use us the seek-by-ord from the terms dict to do
> them.  This would be a huge RAM reduction because we could then sort
> by string fields (eg "title" field) without needing all term bytes
> randomly accessible.
>
> Mike
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message