lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Flex & Docs/AndPositionsEnum
Date Wed, 10 Feb 2010 11:58:01 GMT
On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey <> wrote:

>> Interesting... and segment merging just does its own private
>> concatenation/mapping-around-deletes of the doc/positions?
> I think the answer is yes, but I'm not sure I understand the
> question completely since I'm not sure why you'd ask that in this
> context.

Segment merging is one place that "legitimately" needs to append
docs/positions enum of multiple sub readers... but obviously it can
just do this itself (and it must, since it renumbers the docIDs).

>> what's a "flat positions space"?
> It's something Google once used.  Instead of positions starting with
> 0 at each document, they just keep going.
>  doc 1:  "Three Blind Mice"           - positions 0, 1, 2
>  doc 2:  "Peter Peter Pumpkin Eater"  - positions 3, 4, 5, 6
>> And we don't return "objects or aggregates" with Multi*Enum now...
> Yeah, this is different.  In KS right now, we use a generic
> PostingList, which conveys different information depending on what
> class of Posting it contains.


>> In flex right now the codec is unware that it's being "consumed" by
>> a Multi*Enum.
> Right, but in KinoSearch's case PostingList had to be aware of that
> because the Posting object could be consumed at either the segment
> level or the index level -- so it needed a setDocBase(offset) method
> which adjusted the doc num in the Posting.  It was messy.
> The change I made was to eliminate PolyPostingList and
> PolyPostingListReader, which made it possible to remove the
> setDocBase() method from SegPostingList.

But why didn't you have the Multi*Enums layer add the offset (so that
the codec need not know who's consuming it)?  Performance?

>> It still returns primitives.  If instead we returned an int[] for
>> positions (hmm -- may be a good reason to make positions be an
>> Attribute, Uwe), I think it would still be OK?
> In the flat positions space example, it would be necessary to add an
> offset to each of the positions in that array.  Each segment would
> have a "positions max" analogous to maxDoc(); these would be summed
> to obtain the positions offset the same way we add up maxDoc() now
> to obtain the doc id offset.

OK, but [so far] we don't have that problem with the flex APIs -- the
codec is not aware that there's a multi enum layer consuming it.

> That example may not be a deal breaker for you, but I'm not willing
> to guarantee that Lucy will always return primitives from these
> enums, now and forever, one per method call.

But it'd be a major API change down the road to change this, for
Lucy/KS?  Ie this example seems not to apply to Lucene, and even for
KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up
and make such a major API change to the enums, once "committed".

Also, this is why we're adding Attribute* to all the postings enums,
with flex -- any codec & consumer can use their own private
attributes.  The attrs pass through Multi*Enum.

>> Still torn... I think it's convenience vs performance.
> But convenience for the posting format plugin developer matters too,
> right?

Right but the existince of Multi*Enums isn't affecting the codec dev
(so far, I think).

> Are you confident that a generic aggregator can support all possible
> codecs, or will plugin developers be forced to ensure that
> aggregation works because you've guaranteed to users like Renaud
> that it will?

Well... pretty confident.  So far, at least?  We have an existence
proof :) The codec API really should not (and, should not have to)
bake in details of who's consuming it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message