lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <>
Subject Re: Highlighting and delineating Passages (fragmenting)
Date Tue, 30 May 2017 14:49:05 GMT
On Tue, May 30, 2017 at 9:25 AM Dawid Weiss <> wrote:

> > #2 & #3 is the same requirement; you elaborate on #2 with more detail in
> #3.
> > The UH can't currently do this; but with the OH (original Highlighter)
> you
> > can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had
> said
> > it was easy but I was mistaken; I'm getting rustier on the OH.
> Well, the requirement here is that we do want the context of a hit and
> "broaden" it to roughly X characters (total). I do see that OH does
> have something like this with regexp fragmenter (slop factor), but I
> hoped this should be somewhat easier. I just spent an hour or so
> trying to tune it, but without much success.

What you don't see in regex fragmenter but is critical and found in
SimpleSpanFragmenter is access to the queryScorer to access the
WeightedSpanTerm which yields the positions of the actual matched words.

> > With the original Highlighter, you get TextFragment instances which
> contain
> > the textStartPos & textEndPos.  You can use that info to conditionally
> add
> > ellipsis.
> Yup, I realize that. I just wanted something that'd do it out of the
> box in Solr because I didn't want to add custom code to the
> distribution/core. Sigh.

I think it'd be a nice option for the UH's DefaultPassageFormatter to add
ellipsis at the boundaries.  You could file a patch or just a feature
request in JIRA.

> > With the original Highlighter, you can easily do this by providing the
> > stored text.  When you create the QueryScorer, use "null" for field name
> to
> > highlight all query fields.  The UH can do this as well by highlighting
> the
> > fields that are stored, and call setFieldMatcher to provide a Predicate
> that
> > always return true.
> Wouldn't this be equivalent to requireFieldMatch=false?


> It's not
> exactly what I had in mind -- I don't want to highlight across all
> fields, I want to highlight those that actually contributed to the
> document being selected. Imagine the following:
> { a: "foo bar",
>   b: "foo baz",
>   c: "foo bat" }
> Let's say "a" and "b" are copied to the sink field (default search
> field), but "c" is not. The highlighter is asked to highlight all
> fields. For a query: "foo" it should return a highlight on "a" and
> "b", but not "c". On the other hand, a query "c:foo" should only
> highlight "c". In other words -- the user should clearly see which
> fields actually contributed to the document being part of the search
> result. requireFieldMatch=false is a really crude cannon to solve
> this.

Sure I understand.  With the OH this may not be possible.  With the UH, you
could have a more selective predicate.  On the Solr side you'd need to
devise a hook to make this configurable.  See & SOLR-1105 for a rather
different approach.

> > Yeah I already explained why your snippet-centering requirement simply
> can't
> > be met with the UH.
> Thanks, I thought so. We actually have a custom highlighter (unrelated
> to Solr) in our commercial product that works on a slightly different
> basis than what can be found in Lucene (I think). The pipeline there
> is as follows:
> 1) determine "highlight" offset ranges (from, to, type). Highlight
> "types" can be different so that, for example, one can highlight two
> queries at once (and they can overlap in all kinds of ways).
> 2) process highlight ranges so that they're hierarchically nested
> (split non-tree-like overlaps into hierarchical descents). This
> permits emitting easier html markup later on.
> 3) expand each highlight range to fit certain criteria (typically the
> desired length of the snippet), this expansion here uses a break
> iterator (on words) and respects certain hard limits (like value
> boundaries for multivalue fields);
> 4) score each such expanded range; the scoring formula checks if there
> are any other highlights that fall within the same window; if so, they
> receive a higher score. This results in multi-term matches typically
> ending up at the top of the scoring list.
> 5) emit the best scoring ranges, marking highlights properly.
> We actually use UnifiedHighlighter for the first step above, the rest
> is custom. It can be used to pretty much highlight anything since the
> inputs are the text itself and the ranges to highlight (offsets +
> type). Note it doesn't solve the problem of the default field
> highlighting -- this is something that'd have to be addressed
> separately, but it's been working for us fairly well in practice.
> I'd be glad to contribute this code back to Lucene, but it's kind of
> detached from the infrastructure and it'd require some work to
> integrate. :(

Interesting.  Your strategy is based on the notion of highlight offset
ranges that might overlap.  Are the offset ranges for span query ranges
(including simple phrases)?  Presently the UH's PhraseHelper is designed in
such a way that it filters individual terms's positions, and thus the
resulting OffsetsEnum instances yield offset windows that are only for each
term's offsets instead of being for entire query spans.  Tim and I
discussed the idea of future improvements to redo this so that it'd be span
based, which is kinda interrelated with another TODO on more accurate
phrase highlighting since there are some rare but possible holes in the
current approach:

Or are the overlaps coming from passage offset ranges from separate queries
to the same content?  That I could understand better based on everything
you said.  I'm not sure how your code could be contributed in a way that
fits in with everything else; it seems like a very specialized case.

Since you are already using the UH, and it's not impossible to do the
centered passage thing, you could go that route.  You need to return your
own FieldHighlighter impls so that you can override highlightOffsetsEnums.
At the Solr layer you can subclass SolrExtendedUnifiedHighlighter.  It's
deliberate that these things are extensible; I've seen users need to do
this and it's why we have a TestUnifiedHighlighterExtensibility test to
help us keep this extensible.

~ David
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: | Book:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message