lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8848) UnifiedHighlighter should highlight all Query types that implement Weight.matches
Date Mon, 10 Jun 2019 18:05:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16860211#comment-16860211
] 

David Smiley commented on LUCENE-8848:
--------------------------------------

As an aside, I think it's a shame that we visit the query soooo many times:
 * UH.extractTerms calls rewrite which if you think about it is it's own form of visiting.
Also; extractTerms _does not need to do this_; I removed the EMPTY_INDEXSEARCHER.rewrite and
the tests passed!
 * UH.extractTerms using QueryVisitor.
 * UH.hasUnrecognizedQuery (new in this patch)
 * MultiTermHighlighting.extractAutomata
 * PhraseHelper, if used

This weekend I was imagining a significant refactor involving one visitor subsuming all this.
One visitor collecting terms, automata, detecting unrecognized stuff, and the PhraseHelper
being recast as a subclass of this visitor which adds on it's activities (including SpanQuery
conversion).

It's also true this is all done per field... maybe something could be done about that.  If
this looks interesting, I could take a stab at this and see how it looks. Separate issue of
course.

> UnifiedHighlighter should highlight all Query types that implement Weight.matches
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-8848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8848
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8848.patch
>
>
> The UnifiedHighlighter internally extracts terms and automata from the query.  Usually
this works perfectly but it's possible a Query might be of a type it doesn't know -- a leaf
query that is perhaps in effect similar to a MultiTermQuery yet it might not even be a subclass
of this or it does but the UH doesn't know how to extract an automata from it.  The UH is
oblivious to this and probably won't highlight this query.  If re-analysis of the text is
necessary, the UH will pre-filter all terms to only those it _thinks_ are pertinent.  Or if
offsets are in the postings then the UH could perform very poorly by unleashing this query
on the index for each highlighted document without recognizing re-analysis is a more appropriate
path.
> I think to solve this, the UnifiedHighlighter.getFieldHighlighter needs to inspect the
query (using a QueryVisitor) to see if it can find a leaf query that is not one it knows how
to pull automata from, and is otherwise not in a special list (like MatchAllDocsQuery).  If
we find one, we avoid choosing OffsetSource.POSTINGS or OffsetSource.NONE_NEEDED since we
might in effect have an MTQ like query.  If a MemoryIndex is needed then we don't pre-filter
the terms since we can't assume we know precisely which terms are pertinent.
> We needn't bother extracting terms & automata in this case either; it's wasted effort
which can involve building a CharacterRunAutomaton (see MultiTermHighlighting.binaryToCharRunAutomaton).
 Speaking of which, it'd be nice to avoid that in other cases as well, like for WEIGHT_MATCHES
when we aren't using MemoryIndex (thus no term pre-filtering).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message