lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies
Date Sat, 29 Oct 2016 15:46:58 GMT


ASF GitHub Bot commented on LUCENE-7526:

Github user Timothy055 commented on the issue:
    I don't think there's a way to avoid keeping the position state, unfortunately.  The reason
is that we can move one of the postings enums to the next position, but then realize the next
position for that term is behind the position for a different term (and postings enum) that
also matches the wildcard.  Then we'll update the top and switch to the next postings enum
(by offset now), but once it's exhausted or we switch back to the previous one from interleaving
the position is lost.  :/  An alternative to avoid this would be to change PostingsEnum to
allow fetching of the currentPosition, then nearly all the house keeping would go away.

> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>                 Key: LUCENE-7526
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 6.4
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies by reducing
reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a MemoryIndex
for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a MemoryIndex.
 Can only be used if the query distills down to terms and automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill the memory
index and then once consumed a new one was generated by uninverting the MemoryIndex back into
a TokenStream if there were automata (wildcard/mtq queries) involved.  Now this is avoided,
which should save memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid generating
a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for wildcard/mtq
queries.  This should improve relevancy by providing unified metrics for a wildcard across
all it's term matches
> * Added a HighlightFlag for enabling the newly separated TokenStreamOffsetStrategy since
it can adversely affect passage relevancy

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message