lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Timothy M. Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7438) UnifiedHighlighter
Date Wed, 07 Sep 2016 14:59:21 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470848#comment-15470848
] 

Timothy M. Rodriguez commented on LUCENE-7438:
----------------------------------------------

Some additional information:

h2. Missing features & possible future improvements:
Despite the offset source flexibility and accuracy options of this highlighter, it continues
to be the case that some highlighters have unique features.  The following features are in
the standard Highlighter (and possibly FastVectorHighlighter) but are not in the UnifiedHighlighter
(and thus not PostingsHighlighter either since UH is derived from PH):
* Being able to disable “requireFieldMatch” to thus highlight a query insensitive to whatever
fields are mentioned in the query.
* Using boosts in the query to weight passages.
* Regex pased passage delineation. Though I’m unsure if anyone cares given the existing
BreakIterator options available.
Aside from addressing the feature gaps listed above, there are a couple known things that
would be nice to add:
* The phrase highlighting (implemented by PhraseHelper) could be made more accurate, and probably
faster too, by using techniques in Alan’s Luwak system that uses the Lucene SpanCollector
API introduced in Lucene 5.3. It wasn’t done this way to begin with because this highlighter
was developed originally for Lucene 4.10.
* Wildcard queries usually use TokenStreamFromTermVector, which uninverts the terms out of
a Terms index.  Instead, we now think it would be better to create a bunch of PostingsEnum
for each matching term. This would bring about some simplifications and efficiencies, and
can lead to better passage relevancy. A bonus would be aggregating terms matching the same
automata into a merged PostingsEnum that has a freq() based on the sum of the underlying matching
terms.

h2. Changes from the PostingsHighlighter 
* The UH is more stateful
** Holds the IndexSearcher instead of asking most methods to pass it through.
** Options now have simple setters, and the per-field getters return these. This means the
common case of a setting being non-specific to a field doesn’t require subclassing.
* Multi-valued field handling is improved to ensure that a passage will never span across
values, plus it honors the positionIncrementGap for an analyzed offset source. See MultiValueTokenStream
and SplittingBreakIterator.
* The PH caches all content to be highlighted for all docs and then highlights it all.  The
UH has a limit on this which led to a batching approach.  But if all fields use an Analyzer
or if more than one use term vectors, then instead highlighting happens one doc at a time
since the up-front content caching is not helpful.
* No longer tries to re-use PostingsEnums (or TermsEnum or LeafReader) from one doc to the
next. This really simplified some code; it didn’t seem worth it.
* MultiTermHighlighting’s fake PostingsEnum was made Closeable and we close it to guard
against ramifications of exceptions being thrown during highlighting (e.g. a BreakIterator
bug or TokenStream bug). Nasty to debug!
* (from standard Highlighter) TokenStreamFromTermVector: optimizations to uninvert filtered
(thus sparse) Terms.

h2. Non-Core Dependencies
* MemoryIndex: For Analyzer based highlighting when phrases need to be highlighted accurately.
* Standard Highlighter things:
** TokenStreamFromTermVector: For most multi-term queries. The UH actually has its own derived
copy that has been optimized to handle filtered (thus sparse) Terms. With further work, we
could switch to a different approach and remove it (as indicated earlier).  For as long as
it stays, it’s also possible to replace the existing one with this if we want to do that.
** WeightedSpanTermExtractor: For highlighting phrases accurately to re-use it’s SpanQuery
conversion and rewrite detecting abilities.  Perhaps these parts of WSTE could move to general
SpanQuery utilities.
** TermVectorLeafReader: When highlighting offsets from term vectors.
* PostingHighlighter things:
** Technically, Nothing however it has multiple copies of some things that have not been modified:
Passage, PassageScorer, PassageFormatter, DefaultPassageFormatter.
** Note: Utility BreakIterators are of use to the PH, UH, and even the FVH: WholeBreakIterator,
CustomSeparatorBreakIterator.  Maybe they should move to a utils package that isn’t in any
of these highlighters?


> UnifiedHighlighter
> ------------------
>
>                 Key: LUCENE-7438
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7438
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 6.2
>            Reporter: Timothy M. Rodriguez
>
> The UnifiedHighlighter is an evolution of the PostingsHighlighter that is able to highlight
using offsets in either postings, term vectors, or from analysis (a TokenStream). Lucene’s
existing highlighters are mostly demarcated along offset source lines, whereas here it is
unified -- hence this proposed name. In this highlighter, the offset source strategy is separated
from the core highlighting functionalty. The UnifiedHighlighter further improves on the PostingsHighlighter’s
design by supporting accurate phrase highlighting using an approach similar to the standard
highlighter’s WeightedSpanTermExtractor. The next major improvement is a hybrid offset source
strategythat utilizes postings and “light” term vectors (i.e. just the terms) for highlighting
multi-term queries (wildcards) without resorting to analysis. Phrase highlighting and wildcard
highlighting can both be disabled if you’d rather highlight a little faster albeit not as
accurately reflecting the query.
> We’ve benchmarked an earlier version of this highlighter comparing it to the other
highlighters and the results were exciting! It’s tempting to share those results but it’s
definitely due for another benchmark, so we’ll work on that. Performance was the main motivator
for creating the UnifiedHighlighter, as the standard Highlighter (the only one meeting Bloomberg
Law’s accuracy requirements) wasn’t fast enough, even with term vectors along with several
improvements we contributed back, and even after we forked it to highlight in multiple threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message