lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Flex & Docs/AndPositionsEnum
Date Tue, 09 Feb 2010 20:47:19 GMT
On Tue, Feb 9, 2010 at 1:12 PM, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Tue, Feb 09, 2010 at 11:51:31AM -0500, Michael McCandless wrote:
>
>> You should (when possible/reasonable) instead use
>> ReaderUtil.gatherSubReaders, then iterate through those sub readers
>> asking each for its flex fields.
>
>> But if this is only for testing purposes, and Multi*Enum is more
>> convenient (and, once attrs work correctly), then Multi*Enum is
>> perfectly fine.
>
> Mike, FWIW, I've removed the ability to iterate over posting data at
> anything other than the segment level from KS.  There's still a
> priority-queue-based aggregator for iterating over all terms in a
> multi-segment index, but not for anything lower.

Interesting... and segment merging just does its own private
concatenation/mapping-around-deletes of the doc/positions?

I'm torn on the Multi*Enum.... it's easy to get one "by accident"
(because you're interacting with multi reader) and as a result take a
silent performance hit.  And often the caller can easily change to
operate per segment instead.

But, then, it's very convenient when you need it and don't care about
performance.  EG in Renaud's usage, a test case that is trying to
assert that all indexed docs look right, why should you be forced to
operate per segment?  He shouldn't have to bother with the details of
which field/term/doc was indexed into which segment.

Or, I guess we could argue that this test really should create a
TermQuery and walk the matching docs... instead of using the low level
flex enum APIs.  Because searching impl already knows how to step
through the segments.

Anyway, my current patch on LUCENE-2111 reflects my torn-ness: it
makes it just a bit harder to get Multi*Enum on a multi-reader.  If
you call MultiReader.fields(), it throws
UnsupportedOperationException, and you must instead use
MultiFields.getXXXEnum to explicitly create the enum.

> Forcing pluggable index formats to support the extra level of indirection
> necessary for iterating postings from a high level both introduces
> inefficiency and constrains their development.  Consider what would happen if
> we tried indexed terms within a flat positions space and returned an array of
> positions instead of one position at a time.  The instant you return objects
> or aggregates rather than primitives, you force support for offsets down into
> the low-level decoder.

I don't understand this example -- can you give more detail?  Eg,
what's a "flat positions space"?  And "force support for offsets".
And we don't return "objects or aggregates" with Multi*Enum now...

In flex right now the codec is unware that it's being "consumed" by a
Multi*Enum.  It still returns primitives.  If instead we returned an
int[] for positions (hmm -- may be a good reason to make positions be
an Attribute, Uwe), I think it would still be OK?

> It's not really necessary to iterate aggregated postings across multiple
> segments, so IMO it's best to shunt users like Renaud towards the segment
> level.

Still torn... I think it's convenience vs performance.  But I
want convenience to be an explicit choice.  We shouldn't default our
APIs to a silent perf hit...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message