lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flex & Docs/AndPositionsEnum
Date Tue, 09 Feb 2010 21:44:16 GMT
On Tue, Feb 09, 2010 at 03:47:19PM -0500, Michael McCandless wrote:

> Interesting... and segment merging just does its own private
> concatenation/mapping-around-deletes of the doc/positions?

I think the answer is yes, but I'm not sure I understand the question
completely since I'm not sure why you'd ask that in this context.

> what's a "flat positions space"?  

It's something Google once used.  Instead of positions starting with 0 at each
document, they just keep going.

  doc 1:  "Three Blind Mice"           - positions 0, 1, 2
  doc 2:  "Peter Peter Pumpkin Eater"  - positions 3, 4, 5, 6

> And we don't return "objects or aggregates" with Multi*Enum now...

Yeah, this is different.  In KS right now, we use a generic PostingList, which
conveys different information depending on what class of Posting it contains.

> In flex right now the codec is unware that it's being "consumed" by a
> Multi*Enum.  

Right, but in KinoSearch's case PostingList had to be aware of that because
the Posting object could be consumed at either the segment level or the index
level -- so it needed a setDocBase(offset) method which adjusted the doc num in
the Posting.  It was messy.

The change I made was to eliminate PolyPostingList and PolyPostingListReader,
which made it possible to remove the setDocBase() method from SegPostingList.

> It still returns primitives.  If instead we returned an int[] for positions
> (hmm -- may be a good reason to make positions be an Attribute, Uwe), I
> think it would still be OK?

In the flat positions space example, it would be necessary to add an offset to
each of the positions in that array.  Each segment would have a "positions
max" analogous to maxDoc(); these would be summed to obtain the positions
offset the same way we add up maxDoc() now to obtain the doc id offset.

That example may not be a deal breaker for you, but I'm not willing to
guarantee that Lucy will always return primitives from these enums, now and
forever, one per method call.

> Still torn... I think it's convenience vs performance.  

But convenience for the posting format plugin developer matters too, right?
Are you confident that a generic aggregator can support all possible codecs,
or will plugin developers be forced to ensure that aggregation works because
you've guaranteed to users like Renaud that it will?

Marvin Humphrey


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message