lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Java caching of low-level index data?
Date Wed, 22 Jul 2009 16:37:19 GMT
I think it's a neat idea!

But you are in fact fighting the OS so I'm not sure how well this'll
work in practice.

EG the OS will happily swap out pages from your process if it thinks
you're not using them, so it'd easily swap out your cache in favor of
its own IO cache (this is the "swappiness" configuration on Linux),
which would then kill performance (take a page hit when you finally
did need to use your cache).  In C (possibly requiring root) you could
wire the pages, but we can't do that from javaland, so it's already
not a fair fight.

Mike

On Wed, Jul 22, 2009 at 11:56 AM, eks dev<eksdev@yahoo.co.uk> wrote:
> imo, it is too low level to do it better than OSs. I agree, cache unloading
> effect would be prevented with it, but I am not sure if it brings net-net
> benefit, you would get this problem fixed, but probably OS would kill you
> anyhow (you took valuable memory from OS) on queries that miss your internal
> cache...
>
> We could try to do better if we put more focus on higher levels and do the
> caching there... maybe even cache somhow some CPU work, e.g.  keep dense
> Postings in "faster, less compressed" format, load TermDictionary into
> RAMDirectory and keep the rest on disk.. Ideas in that direction have better
> chance to bring us forward. Take for example FuzzyQuery, there you can do
> some LRU caching at Term level and and save huge amounts of IO and CPU...
>
>
>
>
> From: Shai Erera <serera@gmail.com>
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 22 July, 2009 17:32:34
> Subject: Re: Java caching of low-level index data?
>
> That's an interesting idea.
>
> I always wonder however how much exactly would we gain, vs. the effort spent
> to develop, debug and maintain it. Just some thoughts that we should
> consider regarding this:
>
> * For very large indices, where we think this will generally be good for, I
> believe it's reasonable to assume that the search index will sit on its own
> machine, or set of CPUs, RAM and HD. Therefore given that very few will run
> on the OS other than the search index, I assume the OS cache will be enough
> (if not better)?
>
> * In other cases, where the search app runs together w/ other apps, I'm not
> sure how much we'll gain. I can assume such apps will use a smaller index,
> or will not need to support high query load? If so, will they really care if
> we cache their data, vs. the OS?
>
> Like I said, these are just thoughts. I don't mean to cancel the idea w/
> them, just to think how much will it improve performance (vs. maybe even
> hurt it?). Often I find it that some optimizations that are done will
> benefit very large indices. But these usually get their decent share of
> resources, and the JVM itself is run w/ larger heap etc. So these
> optimizations turn out to not affect such indices much after all. And for
> smaller indices, performance is usually not a problem (well ... they might
> just fit entirely in RAM).
>
> Shai
>
> On Wed, Jul 22, 2009 at 6:21 PM, Nigel <nigelspleen@gmail.com> wrote:
>>
>> In discussions of Lucene search performance, the importance of OS caching
>> of index data is frequently mentioned.  The typical recommendation is to
>> keep plenty of unallocated RAM available (e.g. don't gobble it all up with
>> your JVM heap) and try to avoid large I/O operations that would purge the OS
>> cache.
>>
>> I'm curious if anyone has thought about (or even tried) caching the
>> low-level index data in Java, rather than in the OS.  For example, at the
>> IndexInput level there could be an LRU cache of byte[] blocks, similar to
>> how a RDBMS caches index pages.  (Conveniently, BufferedIndexInput already
>> reads in 1k chunks.) You would reverse the advice above and instead make
>> your JVM heap as large as possible (or at least large enough to achieve a
>> desired speed/space tradeoff).
>>
>> This approach seems like it would have some advantages:
>>
>> - Explicit control over how much you want cached (adjust your JVM heap and
>> cache settings as desired)
>> - Cached index data won't be purged by the OS doing other things
>> - Index warming might be faster, or at least more predictable
>>
>> The obvious disadvantage for some situations is that more RAM would now be
>> tied up by the JVM, rather than managed dynamically by the OS.
>>
>> Any thoughts?  It seems like this would be pretty easy to implement
>> (subclass FSDirectory, return subclass of FSIndexInput that checks the cache
>> before reading, cache keyed on filename + position), but maybe I'm
>> oversimplifying, and for that matter a similar implementation may already
>> exist somewhere for all I know.
>>
>> Thanks,
>> Chris
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message