lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: IndexReader Cache - a different angle
Date Mon, 13 Sep 2010 17:54:15 GMT
Could there be another implementation of sorting? With very large
indexes, and small total result spaces, it would makes sense to
maintain a partial list of sorted ids per field. Every search that
finds new ids, adds them to the master list. There can even have a
cache eviction policy.


On Mon, Sep 13, 2010 at 8:01 AM, Danil ŢORIN <> wrote:
> And it would be nice to have hooks in lucene and avoid managing refs
> to indexReader on reopen() and close() by myself.
> Oh...and to complicate things, my index it's near-realtime using
> IndexWriter.getReader(), so it's not just IndexReader we need to
> change, but also IndexWriter should provide a reader that has proper
> FieldCache implementation.
> And I'm a bit uncomfortable to dig that deep :)
> On Mon, Sep 13, 2010 at 17:51, Danil ŢORIN <> wrote:
>> I'd second that....
>> In my usecase we need to search, sometimes with sort, on pretty big index...
>> So in worst case scenario we get OOM while loading FieldCache as it
>> tries do create an huge array.
>> You can increase -Xmx, go to bigger host, but in the end there WILL be
>> an index big enough to crash you.
>> My idea would be to use something like EhCache with few elements in
>> memory and overflow to disk, so that if there are few unique terms, it
>> would be almost as fast as an array.
>> Otherwise in Collector/Sort/SortField/FieldComparator I would hit the
>> EhCache on disk (yes it would be a huge performance hit) but I won't
>> get OOMs and the results STILL will be sorted.
>> Right now SegmentReader/FieldCacheImpl are pretty hardcoded on
>> FieldCache.DEFAULT
>> And yes, I'm on 3.x...
>> On Mon, Sep 13, 2010 at 16:05, Tim Smith <> wrote:
>>>  i created some time ago
>>> proposing pretty much what seems to be discussed here
>>>  -- Tim
>>> On 09/12/10 10:18, Simon Willnauer wrote:
>>>> On Sun, Sep 12, 2010 at 11:46 AM, Michael McCandless
>>>> <>  wrote:
>>>>> Having hooks to enable an app to manage its own "external, private
>>>>> stuff associated w/ each segment reader" would be useful and it's been
>>>>> asked for in the past.  However, since we've now opened up
>>>>> SegmentReader, SegmentInfo/s, etc., in recent releases, can't an app
>>>>> already do this w/o core API changes?
>>>> The visitor approach would simply be a little more than syntactic
>>>> sugar where only new SubReader instances are passed to the callback.
>>>> You can do the same with the already existing API like
>>>> gatherSubReaders or getSequentialSubReaders. Every API I was talking
>>>> about would just be simplification anyway and would be possible to
>>>> build without changing the core.
>>>>> I know Earwin has built a whole system like this on top of Lucene --
>>>>> Earwin how did you do that...?  Did you make core changes to
>>>>> Lucene...?
>>>>> A custom Codec should be an excellent way to handle the specific use
>>>>> cache (caching certain postings) -- by doing it as a Codec, any time
>>>>> anything in Lucene needs to tap into that posting (query scorers,
>>>>> filters, merging, applying deletes, etc), it hits this cache.  You
>>>>> could model it like PulsingCodec, which wraps any other Codec but
>>>>> handles the low-freq ones itself.  If you do it externally how would
>>>>> core use of postings hit it?  (Or was that not the intention?)
>>>>> I don't understand the filter use-case... the CachingWrapperFilter
>>>>> already caches per-segment, so that reopen is efficient?  How would
>>>>> external cache (built on these hooks) be different?
>>>> Man you are right - never mind :)
>>>> simon
>>>>> For faster filters we have to apply them like we do deleted docs if
>>>>> the filter is "random access" (such as being cached), LUCENE-1536 --
>>>>> flex actually makes this relatively easy now, since the postings API
>>>>> no longer implicitly filters deleted docs (ie you provide your own
>>>>> skipDocs) -- but these hooks won't fix that right?
>>>>> Mike
>>>>> On Sun, Sep 12, 2010 at 3:43 AM, Simon Willnauer
>>>>> <>  wrote:
>>>>>> Hey Shai,
>>>>>> On Sun, Sep 12, 2010 at 6:51 AM, Shai Erera<>
>>>>>>> Hey Simon,
>>>>>>> You're right that the application can develop a Caching mechanism
>>>>>>> outside
>>>>>>> Lucene, and when reopen() is called, if it changed, iterate on
>>>>>>> sub-readers and init the Cache w/ the new ones.
>>>>>> Alright, then we are on the same track I guess!
>>>>>>> However, by building something like that inside Lucene, the application
>>>>>>> will
>>>>>>> get more native support, and thus better performance, in some
>>>>>>> For
>>>>>>> example, consider a field fileType with 10 possible values, and
for the
>>>>>>> sake
>>>>>>> of simplicity, let's say that the index is divided evenly across
>>>>>>> Your
>>>>>>> users always add such a term constraint to the query (e.g. they
want to
>>>>>>> get
>>>>>>> results of fileType:pdf or fileType:odt, and perhaps sometimes
>>>>>>> but not
>>>>>>> others). You have basically two ways of supporting this:
>>>>>>> (1) Add such a term to the query / clause to a BooleanQuery w/
an AND
>>>>>>> relation -- cons is that this term / posting is read for every
>>>>>> Oh I wasn't saying that a cache framework would be obsolet and
>>>>>> shouldn't be part of lucene. My intention was rather to generalize
>>>>>> this functionality so that we can make the API change more easily
>>>>>> at the same time brining the infrastructure you are proposing in
>>>>>> place.
>>>>>> Regarding you example above, filters are a very good example where
>>>>>> something like that could help to improve performance and we should
>>>>>> provide it with lucene core but I would again prefer the least
>>>>>> intrusive way to do so. If we can make that happen without adding
>>>>>> cache agnostic API we should do it. We really should try to sketch
>>>>>> a simple API with gives us access to the opened segReaders and see
>>>>>> that would be sufficient for our usecases. Specialization will always
>>>>>> be possible but I doubt that it is needed.
>>>>>>> (2) Write a Filter which works at the top IR level, that is refreshed
>>>>>>> whenever the index is refreshed. This is better than (1), however
>>>>>>> some
>>>>>>> disadvantages:
>>>>>>> (2.1) As Mike already proved (on some issue which I don't remember
>>>>>>> subject/number at the moment), if we could get Filter down to
the lower
>>>>>>> level components of Lucene's search, so e.g. it is used as the
>>>>>>> docs
>>>>>>> OBS, we can get better performance w/ Filters.
>>>>>>> (2.2) The Filter is refreshed for the entire IR, and not just
>>>>>>> changed
>>>>>>> segments. Reason is, outside Collector, you have no way of telling
>>>>>>> IndexSearcher "use Filter F1 for segment S1 and F2 for segment
>>>>>>> Loading/refreshing the Filter may be expensive, and definitely
>>>>>>> perform
>>>>>>> well w/ NRT, where by definition you'd like to get small changes
>>>>>>> searchable
>>>>>>> very fast.
>>>>>> No doubt you are right about the above. A
>>>>>> PerSegmentCachingFilterWrapper would be something we can easily do
>>>>>> an application level basis with the infrastructure we are talking
>>>>>> about in place. While I don't exactly know how I feel that this
>>>>>> particular problem should rather be addressed internally and I'm
>>>>>> sure if the high level Cache mechanism is the right way to do it
>>>>>> this is just a gut feeling. But when I think about it twice it might
>>>>>> be way sufficient enough to do it....
>>>>>>> Therefore I think that if we could provide the necessary hooks
>>>>>>> Lucene,
>>>>>>> let's call it a Cache plug-in for now, we can incrementally improve
>>>>>>> search process. I don't want to go too far into the design of
a generic
>>>>>>> plug-ins mechanism, but you're right (again :)) -- we could offer
>>>>>>> reopen(PluginProvider) which is entirely not about Cache, and
>>>>>>> would
>>>>>>> become one of the Plugins the PluginProvider provides. I just
try to
>>>>>>> learn
>>>>>>> from past experience -- when the discussion is focused, there's
>>>>>>> better
>>>>>>> chance of getting to a resolution. However if you think that
in this
>>>>>>> case, a
>>>>>>> more generic API, as PluginProvider, would get us to a resolution
>>>>>>> faster, I
>>>>>>> don't mind spend some time to think about it. But for all practical
>>>>>>> purposes, we should IMO start w/ a Cache plug-in, that is called
>>>>>>> that,
>>>>>>> and if it catches, generify it ...
>>>>>> I absolutely agree the API might be more generic but our current
>>>>>> use-case / PoC should be a caching. I don't like the name Plugin
>>>>>> thats a personal thing since you are not pluggin anything is.
>>>>>> Something like SubreaderCallback or ReaderVisitor might be more
>>>>>> accurate but lets argue about the details later. Why not sketching
>>>>>> something out for the filter problem and follow on from there? The
>>>>>> more iteration the better and back to your question if that would
>>>>>> something which could make it to be committable I would say if it
>>>>>> works stand alone / not to tightly coupled I would absolutely say
>>>>>>> Unfortunately, I haven't had enough experience w/ Codecs yet
(still on
>>>>>>> 3x)
>>>>>>> so I can't comment on how feasible that solution is. I'll take
>>>>>>> word for
>>>>>>> it that it's doable :). But this doesn't give us a 3x solution
... the
>>>>>>> Caching framework on trunk can be developed w/ Codecs.
>>>>>> I guess nobody really has except of mike and maybe one or two others
>>>>>> but what I have done so far regarding codecs I would say that is
>>>>>> place to solve this particular problem. Maybe even lower than that
>>>>>> a Directory level. Anyhow, lets focus on application level caches
>>>>>> now. We are not aiming to provide a whole full fledged Cache API
>>>>>> the infrastructure to make it easier to build those on a app basis
>>>>>> which would be a valuable improvement. We should also look at Solr's
>>>>>> cache implementations and how they could benefit from this efforts
>>>>>> since Solr uses app-level caching we can learn from API design wise.
>>>>>> simon
>>>>>>> Shai
>>>>>>> On Sat, Sep 11, 2010 at 10:41 PM, Simon Willnauer
>>>>>>> <>  wrote:
>>>>>>>> Hi Shai,
>>>>>>>> On Sat, Sep 11, 2010 at 8:08 PM, Shai Erera<>
>>>>>>>>> Hi
>>>>>>>>> Lucene's Caches have been heavilydiscussed before (e.g.,
>>>>>>>>> LUCENE-2133 and LUCENE-2394) and from what I can tell,
there have
>>>>>>>>> been
>>>>>>>>> many proposals to attack this problem, w/ no developed
>>>>>>>> I didn't go through those issues so forgive me if something
I bring up
>>>>>>>> has already been discussed.
>>>>>>>> I have a couple of question about your proposal - please
find them
>>>>>>>> inline...
>>>>>>>>> I'd like to explore a different, IMO much simpler, angle
to attach
>>>>>>>>> this
>>>>>>>>> problem. Instead of having Lucene manage the Cache itself,
we let the
>>>>>>>>> application manage it, however Lucene will provide the
>>>>>>>>> hooks
>>>>>>>>> in IndexReader to allow it. The hooks I have in mind
>>>>>>>>> (1) IndexReader current API for TermDocs, TermEnum, TermPositions
>>>>>>>>> etc.
>>>>>>>>> --
>>>>>>>>> already exists.
>>>>>>>>> (2) When reopen() is called, Lucene will take care to
call a
>>>>>>>>> Cache.load(IndexReader), so that the application can
pull whatever
>>>>>>>>> information
>>>>>>>>> it needs from the passed-in IndexReader.
>>>>>>>> Would that do anything else than passing the new reader (if
>>>>>>>> to the caches load method? I wonder if this is more than
>>>>>>>> If(newReader != oldReader)
>>>>>>>>  Cache.load(newReader)
>>>>>>>> If so something like that should be done on a segment reader
>>>>>>>> right? From my perspective this isn't more than a callback
or visitor
>>>>>>>> that should walk though the subreaders and called for each
>>>>>>>> sub-reader. A cache-warming visitor / callback would then
be trivial
>>>>>>>> and the API would be more general.
>>>>>>>>> So to be more concrete on my proposal, I'd like to support
caching in
>>>>>>>>> the following way (and while I've spent some time thinking
about it,
>>>>>>>>> I'm
>>>>>>>>> sure there are great suggestions to improve it):
>>>>>>>>> * Application provides a CacheFactory to,
>>>>>>>>> which
>>>>>>>>> exposes some very simple API, such as createCache, or
>>>>>>>>> initCache(IndexReader) etc. Something which returns a
Cache object,
>>>>>>>>> which does not have very strict/concrete API.
>>>>>>>> My first question would be why the reader should know about
Cache if
>>>>>>>> there is no strict / concrete API?
>>>>>>>> I can follow you with the CacheFactory to create cache objects
but why
>>>>>>>> would the reader have to know / "receive" this object? Maybe
this is
>>>>>>>> answered further down the path but I don't see the reason
why the
>>>>>>>> notion of a "cache" must exist within open/reopen or if that
could be
>>>>>>>> implemented in a more general "cache" - agnostic way.
>>>>>>>>> * IndexReader, most probably at the SegmentReader level
>>>>>>>>> CacheFactory to create a new Cache instance and calls
>>>>>>>>> load(IndexReader) method, so that the Cache would initialize
>>>>>>>> That is what I was thinking above - yet is that more than
a callback
>>>>>>>> for each reopened or opened segment reader?
>>>>>>>>> * The application can use CacheFactory to obtain the
Cache object per
>>>>>>>>> IndexReader (for example, during Collector.setNextReader),
or we can
>>>>>>>>> have IndexReader offer a getCache() method.
>>>>>>>> :)  until here the cache is only used by the application
itself not by
>>>>>>>> any Lucene API, right? I can think of many application specific
>>>>>>>> that could be useful to be associated with an IR beyond the
>>>>>>>> use case - again this could be a more general API solving
>>>>>>>> problem.
>>>>>>>>> * One of Cache API would be getCache(TYPE), where TYPE
is a String or
>>>>>>>>> Object, or an interface CacheType w/ no methods, just
to be a marker
>>>>>>>>> one, and the application is free to impl it however it
wants. That's
>>>>>>>>> a
>>>>>>>>> loose API, I know, but completely at the application
hands, which
>>>>>>>>> makes
>>>>>>>>> Lucene code simpler.
>>>>>>>> I like the idea together with the metadata associating functionality
>>>>>>>> from above something like public T IndexReader#get(Type<T>
>>>>>>>> Hmm that looks quiet similar to Attributes, does it?! :)
However this
>>>>>>>> could be done in many ways but again "cache" - agnositc
>>>>>>>>> * We can introduce a TermsCache, TermEnumCache and TermVectorCache
>>>>>>>>> provide the user w/ IndexReader-similar API, only more
efficient than
>>>>>>>>> say TermDocs -- something w/ random access to the docs
>>>>>>>>> perhaps
>>>>>>>>> even an OpenBitSet. Lucene can take advantage of it if,
say, we
>>>>>>>>> create a
>>>>>>>>> CachingSegmentReader which makes use of the cache, and
checks every
>>>>>>>>> time
>>>>>>>>> termDocs() is called if the required Term is cached or
not etc. I
>>>>>>>>> admit
>>>>>>>>> I may be thinking too much ahead.
>>>>>>>> I see what you are trying to do here. I also see how this
could be
>>>>>>>> useful but I guess coming up with a stable APi which serves
lots of
>>>>>>>> applications would be quiet hard. A CachingSegmentReader
could be a
>>>>>>>> very simple decorator which would not require to touch the
>>>>>>>> interface. Something like that could be part of lucene but
I'm not
>>>>>>>> sure if necessarily part of lucene core.
>>>>>>>>> That's more or less what I've been thinking. I'm sure
there are many
>>>>>>>>> details to iron out, but I hope I've managed to pass
the general
>>>>>>>>> proposal through to you.
>>>>>>>> Absolutely, this is how it works isn't it!
>>>>>>>>> What I'm after first, is to allow applications deal w/
>>>>>>>>> caching
>>>>>>>>> more
>>>>>>>>> natively. For example, if you have a posting w/ payloads
you'd like
>>>>>>>>> to
>>>>>>>>> read into memory, or if you would like a term's TermDocs
to be cached
>>>>>>>>> (to be used as a Filter) etc. -- instead of writing something
>>>>>>>>> can
>>>>>>>>> work at the top IndexReader level, you'd be able to take
advantage of
>>>>>>>>> Lucene internals, i.e. refresh the Cache only for the
new segments
>>>>>>>>> ...
>>>>>>>> I wonder if a custom codec would be the right place to implement
>>>>>>>> caching / mem resident structures for Postings with payloads
etc. You
>>>>>>>> could do that on a higher level too but codec seems to be
the way to
>>>>>>>> go here, right?
>>>>>>>> To utilize per segment capabilities a callback for (re)opened
>>>>>>>> readers would be sufficient or do I miss something?
>>>>>>>> simon
>>>>>>>>> I'm sure that after this will be in place, we can refactor
>>>>>>>>> to
>>>>>>>>> work w/ that API, perhaps as a Cache specific implementation.
But I
>>>>>>>>> leave that for later.
>>>>>>>>> I'd appreciate your comments. Before I set to implement
it, I'd like
>>>>>>>>> to
>>>>>>>>> know if the idea has any chances of making it to a commit
>>>>>>>>> Shai
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail:
>>>>>>>> For additional commands, e-mail:
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail:
>>>>>> For additional commands, e-mail:
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail:
>>>> For additional commands, e-mail:
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Lance Norskog

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message