lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Payload Loading and Reloading
Date Thu, 29 Nov 2007 23:01:41 GMT
The use case I have is for Lucene-1001, so the caching is going to  
happen somewhere in Lucene, not necessarily the application.  I think  
caching it in SegTermPos. is the simplest, but I will have to look at  
the alternatives.  It is particularly problematic in the Near Spans  
case (ordered and unordered) but maybe I can address it there.

As for the cost of the seeks, why can't we just document that this is  
what is going on and discourage people from doing it?  However, if  
they really feel they need to call it again, why not let them?  After  
all, it's still cheaper than going back to the beginning and starting  
over.  Just b/c you can call something twice doesn't mean you must.


On Nov 29, 2007, at 5:34 PM, Michael Busch wrote:

> I designed the API with this limitation intentionally to prevent users
> from thinking that they can call TermPositions.getPayload() more than
> once with no costs.
> If we allow to call it more often than once then we have to seek  
> back in
> the posting stream. Even if this is just a seek in the underlying
> IndexInput buffer, we still have to perform an arraycopy from that
> buffer to the array that getPayload() returns. If the beginning of the
> payload is already outside the current buffer, then a seek on the HD
> will happen in addition, which is even more expensive.
> So I'd like to keep the API as is. An application should always be  
> able
> to buffer a payload byte[] array if it needs to access it more than
> once. For convenience, user could also create a very simple
> Termpositions decorator that caches the most recently loaded payload  
> and
> allows calling getPayload() more than once.
> However, I hesitate to add such a payload caching to
> SegmentTermPositions, because the size of the payloads is
> application-specific and so should the policy be that grows/shrinks a
> caching byte[] array.
> -Michael
> Grant Ingersoll wrote:
>> In working on LUCENE-1001, things are getting a bit complicated with
>> loading payloads in overlapping spans (which causes the dreaded Can't
>> load payload more than once error).
>> This got me thinking about why we need the rule that payloads can  
>> only
>> be loaded once.  I forget the reasoning behind this.  Can we just  
>> store
>> where the current position before we load the payload and then seek  
>> back
>> to that point if we need to load the payload again?  I suppose in the
>> case of really large payloads the seek on the IndexInput could be
>> expensive, but in reality, most payloads aren't likely to be more  
>> than a
>> few bytes, right?  There also seems to be some interactions with the
>> lazy skipping that I haven't quite pinned down yet.  What else am I
>> forgetting?
>> The other alternative I can think of is I could cache the payloads,  
>> but
>> that seems unwieldy too.
>> -Grant
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message