Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <20100210194227.GA18842@rectangular.com>
References: <4B716C5F.6020907@deri.org> <4B718EEB.3070605@deri.org>
	 <9ac0c6aa1002090851v1183e96fybbcd61f1a277b00a@mail.gmail.com>
	 <20100209181235.GA15349@rectangular.com>
	 <9ac0c6aa1002091247j36033b9bt65c787d8703d371e@mail.gmail.com>
	 <20100209214416.GA17405@rectangular.com>
	 <9ac0c6aa1002100358v5e51c743q65054b81e9da5067@mail.gmail.com>
	 <20100210132743.GA13836@rectangular.com>
	 <9ac0c6aa1002100933g2f97801fy6c3e7b0064fe4ddf@mail.gmail.com>
	 <20100210194227.GA18842@rectangular.com>
Date: Thu, 11 Feb 2010 08:30:14 -0500
Message-ID: <9ac0c6aa1002110530k756d0f57rdd5aad89430ca716@mail.gmail.com>
Subject: Re: Flex & Docs/AndPositionsEnum
From: Michael McCandless <lucene@mikemccandless.com>
To: java-user@lucene.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Feb 10, 2010 at 2:42 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Wed, Feb 10, 2010 at 12:33:27PM -0500, Michael McCandless wrote:
>
>> In Lucene, skipping is done through the aggregator,
>
> I had a look at MultiDocsEnum in the flex blanch.  It doesn't know when
> sub-enum is reading skip data.

I'm confused -- the MultiDocsEnum's advance method impl is the only
place where we invoke advance on the sub readers.  Oh you're saying we
don't know if the underlying enum actually skipped vs just scanned?

Isn't the skip data also based on deltas?  So even if real skipping
happened, Lucy/KS would not "lose" the offset that the aggregator had
previously added?  Or maybe I'm lost on what the issue is here...

>> > I suppose another possibility would have been to have the aggregator
>> > keep its own Posting and copy all data over from the
>> > SegPostingList's Posting on each iteration then add its offset.
>>
>> I think this is what Lucene does (?).  EG the aggregator holds its own
>> "int doc" which it must copy to (adding the offset) from the
>> underlying sub enum.
>
> That's fine for a *primitive* type.  Modifying an int returned by a sub-enum
> doesn't affect the sub-enum.  :)
>
> The problem arises when there's an opaque *object* conveying data to the
> consumer.  The aggregator knows everything there is to know about an int, but
> it doesn't know what it needs to do to prepare an opaque object owned by the
> sub-enum for consumption at the aggregate level.

OK.

>> > However, that would have been a lot less efficient, and it still
>> > wouldn't have worked for the "flat positions space" example because
>> > the generic aggregator would not have known about the needs of the
>> > specific codec.
>>
>> But aggregator could also add the positions offset on each
>> nextPosition() call, in Lucene.  Like that use case could be made to
>> work, if Lucene had used a flat position space.
>
> A generic aggregator wouldn't know that it needed to do that.  The postings
> codec developer would be forced to write aggregation code in addition to
> segment-level code.

Right, if position were not primitive but contained within an opaque
(to the aggregator) object.  And, you were doing the flat positions
space.

I guess... this restriction still seems academic... ie, not a real
issue in Lucene.  We use primitives in Lucene for doc/position, which
we can remap as needed.  We then require that opaque stuff (using
attributes) "survive", unchanged, when passed through the aggregator.
Either that, or, you enum segment by segment in the code.  I don't [yet]
see this as an issue for Lucene...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org