lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <>
Subject Re: IndexReader Cache - a different angle
Date Sat, 11 Sep 2010 19:41:12 GMT
Hi Shai,

On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <> wrote:
> Hi
> Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
> LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
> many proposals to attack this problem, w/ no developed solution.

I didn't go through those issues so forgive me if something I bring up
has already been discussed.
I have a couple of question about your proposal - please find them inline...

> I'd like to explore a different, IMO much simpler, angle to attach this
> problem. Instead of having Lucene manage the Cache itself, we let the
> application manage it, however Lucene will provide the necessary hooks
> in IndexReader to allow it. The hooks I have in mind are:
> (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
> already exists.
> (2) When reopen() is called, Lucene will take care to call a
> Cache.load(IndexReader), so that the application can pull whatever
> information
> it needs from the passed-in IndexReader.
Would that do anything else than passing the new reader (if reopened)
to the caches load method? I wonder if this is more than
If(newReader != oldReader)

If so something like that should be done on a segment reader anyway,
right? From my perspective this isn't more than a callback or visitor
that should walk though the subreaders and called for each reopened
sub-reader. A cache-warming visitor / callback would then be trivial
and the API would be more general.

> So to be more concrete on my proposal, I'd like to support caching in
> the following way (and while I've spent some time thinking about it, I'm
> sure there are great suggestions to improve it):
> * Application provides a CacheFactory to, which
> exposes some very simple API, such as createCache, or
> initCache(IndexReader) etc. Something which returns a Cache object,
> which does not have very strict/concrete API.

My first question would be why the reader should know about Cache if
there is no strict / concrete API?
I can follow you with the CacheFactory to create cache objects but why
would the reader have to know / "receive" this object? Maybe this is
answered further down the path but I don't see the reason why the
notion of a "cache" must exist within open/reopen or if that could be
implemented in a more general "cache" - agnostic way.
> * IndexReader, most probably at the SegmentReader level uses
> CacheFactory to create a new Cache instance and calls its
> load(IndexReader) method, so that the Cache would initialize itself.
That is what I was thinking above - yet is that more than a callback
for each reopened or opened segment reader?

> * The application can use CacheFactory to obtain the Cache object per
> IndexReader (for example, during Collector.setNextReader), or we can
> have IndexReader offer a getCache() method.
:)  until here the cache is only used by the application itself not by
any Lucene API, right? I can think of many application specific data
that could be useful to be associated with an IR beyond the cacheing
use case - again this could be a more general API solving that
> * One of Cache API would be getCache(TYPE), where TYPE is a String or
> Object, or an interface CacheType w/ no methods, just to be a marker
> one, and the application is free to impl it however it wants. That's a
> loose API, I know, but completely at the application hands, which makes
> Lucene code simpler.
I like the idea together with the metadata associating functionality
from above something like public T IndexReader#get(Type<T> type).
Hmm that looks quiet similar to Attributes, does it?! :) However this
could be done in many ways but again "cache" - agnositc
> * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
> provide the user w/ IndexReader-similar API, only more efficient than
> say TermDocs -- something w/ random access to the docs inside, perhaps
> even an OpenBitSet. Lucene can take advantage of it if, say, we create a
> CachingSegmentReader which makes use of the cache, and checks every time
> termDocs() is called if the required Term is cached or not etc. I admit
> I may be thinking too much ahead.
I see what you are trying to do here. I also see how this could be
useful but I guess coming up with a stable APi which serves lots of
applications would be quiet hard. A CachingSegmentReader could be a
very simple decorator which would not require to touch the IR
interface. Something like that could be part of lucene but I'm not
sure if necessarily part of lucene core.

> That's more or less what I've been thinking. I'm sure there are many
> details to iron out, but I hope I've managed to pass the general
> proposal through to you.

Absolutely, this is how it works isn't it!

> What I'm after first, is to allow applications deal w/ postings caching more
> natively. For example, if you have a posting w/ payloads you'd like to
> read into memory, or if you would like a term's TermDocs to be cached
> (to be used as a Filter) etc. -- instead of writing something that can
> work at the top IndexReader level, you'd be able to take advantage of
> Lucene internals, i.e. refresh the Cache only for the new segments ...

I wonder if a custom codec would be the right place to implement
caching / mem resident structures for Postings with payloads etc. You
could do that on a higher level too but codec seems to be the way to
go here, right?
To utilize per segment capabilities a callback for (re)opened segment
readers would be sufficient or do I miss something?

> I'm sure that after this will be in place, we can refactor FieldCache to
> work w/ that API, perhaps as a Cache specific implementation. But I
> leave that for later.
> I'd appreciate your comments. Before I set to implement it, I'd like to
> know if the idea has any chances of making it to a commit :).
> Shai

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message