lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <peterlkee...@gmail.com>
Subject Re: Payloads and PhraseQuery
Date Fri, 27 Jul 2007 18:54:36 GMT
I have a question about the way fields are analyzed and inverted by the
index writer. Currently, if a field has multiple occurrences in a document,
each occurrence is analyzed separately (see DocumentsWriter.processField).
Is it safe to assume that this behavior won't change in the future? The
reason I ask is that my custom analyzer's 'tokenStream' method creates a
custom filter which produces a payload based on the existence of each field
occurrence. However, if DocumentsWriter was changed and combined all the
occurrences before inversion, my scheme wouldn't work.  Since payloads are
created by filters/tokenizers, it helps to keep things flexible.

Thanks,
Peter


On 7/12/07, Grant Ingersoll <gsingers@apache.org> wrote:
>
>
> On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote:
>
>
> >
> > Hmm... okay so the issue is that in order to get the payload data, you
> > have to have a TermPositions instance.
> >
> > instead of adding getPayload methods to the Spans class (which as Paul
> > points out, can have nesting issues) perhaps more general solutions
> > would
> > be:
> >
> > a) a more high level getPayload API that let's you get a payload
> > arbitrarily for a toc/position (perhaps as part of the TernDocs
> > API?) ...
> > then for Spans you could use this new API with Spans.start() and
> > Spans.end(). (and all the positions in between)
>
> Not sure I follow this.  I don't see the fit w/ TermDocs.
> >
> > b) add a variation of the TermPositions class to allow people to
> > iterate
> > through the terms of a TermDoc in position order (TermPosition first
> > iterates over the Terms and then over the positions) ... then you
> > could
> > seek(span.start()) to get the Payload data
> >
> > c) add methods to the Spans API to get the subspans (if any) ... this
> > would be the Spans corrilary to getTerms() and would always return
> > TermSpans which would have TermPositions for getting payload data.
>
>
> This could be a good alternative.
>
> When we first talked about payloads we wondered if we could just make
> all Queries into SpanQueries by passing TermPositions instead of term
> docs, but in the end decided not to do it because of performance
> issues (some of which are lessened by lazy loading of TermPositions.
>
> The thing is, I think, that the Spans is already moving you along in
> the term positions, so it just seems like a natural fit to have it
> there, even if there is nesting.  It doesn't seem like it would be
> that hard to then return back the nesting stuff b/c you are just
> collating the results from the underlying SpanTermQuery.  Having said
> that, I haven't looked into the actual code, so take that w/ a grain
> of salt.
>
> I will try to do some more investigation, as others are welcome to
> do.  Perhaps we should move this to dev?
>
> Cheers,
> Grant
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message