lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: IndexReader Cache - a different angle
Date Sun, 12 Sep 2010 04:51:18 GMT
Hey Simon,

You're right that the application can develop a Caching mechanism outside
Lucene, and when reopen() is called, if it changed, iterate on the
sub-readers and init the Cache w/ the new ones.

However, by building something like that inside Lucene, the application will
get more native support, and thus better performance, in some cases. For
example, consider a field fileType with 10 possible values, and for the sake
of simplicity, let's say that the index is divided evenly across them. Your
users always add such a term constraint to the query (e.g. they want to get
results of fileType:pdf or fileType:odt, and perhaps sometimes both, but not
others). You have basically two ways of supporting this:
(1) Add such a term to the query / clause to a BooleanQuery w/ an AND
relation -- cons is that this term / posting is read for every query.

(2) Write a Filter which works at the top IR level, that is refreshed
whenever the index is refreshed. This is better than (1), however has some
disadvantages:

(2.1) As Mike already proved (on some issue which I don't remember its
subject/number at the moment), if we could get Filter down to the lower
level components of Lucene's search, so e.g. it is used as the deleted docs
OBS, we can get better performance w/ Filters.

(2.2) The Filter is refreshed for the entire IR, and not just the changed
segments. Reason is, outside Collector, you have no way of telling
IndexSearcher "use Filter F1 for segment S1 and F2 for segment F2".
Loading/refreshing the Filter may be expensive, and definitely won't perform
well w/ NRT, where by definition you'd like to get small changes searchable
very fast.

Therefore I think that if we could provide the necessary hooks in Lucene,
let's call it a Cache plug-in for now, we can incrementally improve the
search process. I don't want to go too far into the design of a generic
plug-ins mechanism, but you're right (again :)) -- we could offer a
reopen(PluginProvider) which is entirely not about Cache, and Cache would
become one of the Plugins the PluginProvider provides. I just try to learn
from past experience -- when the discussion is focused, there's a better
chance of getting to a resolution. However if you think that in this case, a
more generic API, as PluginProvider, would get us to a resolution faster, I
don't mind spend some time to think about it. But for all practical
purposes, we should IMO start w/ a Cache plug-in, that is called like that,
and if it catches, generify it ...

Unfortunately, I haven't had enough experience w/ Codecs yet (still on 3x)
so I can't comment on how feasible that solution is. I'll take your word for
it that it's doable :). But this doesn't give us a 3x solution ... the
Caching framework on trunk can be developed w/ Codecs.

Shai

On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer <
simon.willnauer@googlemail.com> wrote:

> Hi Shai,
>
> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera <serera@gmail.com> wrote:
> > Hi
> >
> > Lucene's Caches have been heavilydiscussed before (e.g., LUCENE-831,
> > LUCENE-2133 and LUCENE-2394) and from what I can tell, there have been
> > many proposals to attack this problem, w/ no developed solution.
>
> I didn't go through those issues so forgive me if something I bring up
> has already been discussed.
> I have a couple of question about your proposal - please find them
> inline...
>
> >
> > I'd like to explore a different, IMO much simpler, angle to attach this
> > problem. Instead of having Lucene manage the Cache itself, we let the
> > application manage it, however Lucene will provide the necessary hooks
> > in IndexReader to allow it. The hooks I have in mind are:
> >
> > (1) IndexReader current API for TermDocs, TermEnum, TermPositions etc. --
> > already exists.
> >
> > (2) When reopen() is called, Lucene will take care to call a
> > Cache.load(IndexReader), so that the application can pull whatever
> > information
> > it needs from the passed-in IndexReader.
> Would that do anything else than passing the new reader (if reopened)
> to the caches load method? I wonder if this is more than
> If(newReader != oldReader)
>  Cache.load(newReader)
>
> If so something like that should be done on a segment reader anyway,
> right? From my perspective this isn't more than a callback or visitor
> that should walk though the subreaders and called for each reopened
> sub-reader. A cache-warming visitor / callback would then be trivial
> and the API would be more general.
>
>
> > So to be more concrete on my proposal, I'd like to support caching in
> > the following way (and while I've spent some time thinking about it, I'm
> > sure there are great suggestions to improve it):
> >
> > * Application provides a CacheFactory to IndexReader.open/reopen, which
> > exposes some very simple API, such as createCache, or
> > initCache(IndexReader) etc. Something which returns a Cache object,
> > which does not have very strict/concrete API.
>
> My first question would be why the reader should know about Cache if
> there is no strict / concrete API?
> I can follow you with the CacheFactory to create cache objects but why
> would the reader have to know / "receive" this object? Maybe this is
> answered further down the path but I don't see the reason why the
> notion of a "cache" must exist within open/reopen or if that could be
> implemented in a more general "cache" - agnostic way.
> >
> > * IndexReader, most probably at the SegmentReader level uses
> > CacheFactory to create a new Cache instance and calls its
> > load(IndexReader) method, so that the Cache would initialize itself.
> That is what I was thinking above - yet is that more than a callback
> for each reopened or opened segment reader?
>
> >
> > * The application can use CacheFactory to obtain the Cache object per
> > IndexReader (for example, during Collector.setNextReader), or we can
> > have IndexReader offer a getCache() method.
> :)  until here the cache is only used by the application itself not by
> any Lucene API, right? I can think of many application specific data
> that could be useful to be associated with an IR beyond the cacheing
> use case - again this could be a more general API solving that
> problem.
> >
> > * One of Cache API would be getCache(TYPE), where TYPE is a String or
> > Object, or an interface CacheType w/ no methods, just to be a marker
> > one, and the application is free to impl it however it wants. That's a
> > loose API, I know, but completely at the application hands, which makes
> > Lucene code simpler.
> I like the idea together with the metadata associating functionality
> from above something like public T IndexReader#get(Type<T> type).
> Hmm that looks quiet similar to Attributes, does it?! :) However this
> could be done in many ways but again "cache" - agnositc
> >
> > * We can introduce a TermsCache, TermEnumCache and TermVectorCache to
> > provide the user w/ IndexReader-similar API, only more efficient than
> > say TermDocs -- something w/ random access to the docs inside, perhaps
> > even an OpenBitSet. Lucene can take advantage of it if, say, we create a
> > CachingSegmentReader which makes use of the cache, and checks every time
> > termDocs() is called if the required Term is cached or not etc. I admit
> > I may be thinking too much ahead.
> I see what you are trying to do here. I also see how this could be
> useful but I guess coming up with a stable APi which serves lots of
> applications would be quiet hard. A CachingSegmentReader could be a
> very simple decorator which would not require to touch the IR
> interface. Something like that could be part of lucene but I'm not
> sure if necessarily part of lucene core.
>
> > That's more or less what I've been thinking. I'm sure there are many
> > details to iron out, but I hope I've managed to pass the general
> > proposal through to you.
>
> Absolutely, this is how it works isn't it!
>
> >
> > What I'm after first, is to allow applications deal w/ postings caching
> more
> > natively. For example, if you have a posting w/ payloads you'd like to
> > read into memory, or if you would like a term's TermDocs to be cached
> > (to be used as a Filter) etc. -- instead of writing something that can
> > work at the top IndexReader level, you'd be able to take advantage of
> > Lucene internals, i.e. refresh the Cache only for the new segments ...
>
> I wonder if a custom codec would be the right place to implement
> caching / mem resident structures for Postings with payloads etc. You
> could do that on a higher level too but codec seems to be the way to
> go here, right?
> To utilize per segment capabilities a callback for (re)opened segment
> readers would be sufficient or do I miss something?
>
> simon
> >
> > I'm sure that after this will be in place, we can refactor FieldCache to
> > work w/ that API, perhaps as a Cache specific implementation. But I
> > leave that for later.
> >
> > I'd appreciate your comments. Before I set to implement it, I'd like to
> > know if the idea has any chances of making it to a commit :).
> >
> > Shai
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message