lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Highlighting and delineating Passages (fragmenting)
Date Tue, 30 May 2017 13:25:21 GMT
> #2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
> The UH can't currently do this; but with the OH (original Highlighter) you
> can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had said
> it was easy but I was mistaken; I'm getting rustier on the OH.

Well, the requirement here is that we do want the context of a hit and
"broaden" it to roughly X characters (total). I do see that OH does
have something like this with regexp fragmenter (slop factor), but I
hoped this should be somewhat easier. I just spent an hour or so
trying to tune it, but without much success.

> With the original Highlighter, you get TextFragment instances which contain
> the textStartPos & textEndPos.  You can use that info to conditionally add
> ellipsis.

Yup, I realize that. I just wanted something that'd do it out of the
box in Solr because I didn't want to add custom code to the
distribution/core. Sigh.

> With the original Highlighter, you can easily do this by providing the
> stored text.  When you create the QueryScorer, use "null" for field name to
> highlight all query fields.  The UH can do this as well by highlighting the
> fields that are stored, and call setFieldMatcher to provide a Predicate that
> always return true.

Wouldn't this be equivalent to requireFieldMatch=false? It's not
exactly what I had in mind -- I don't want to highlight across all
fields, I want to highlight those that actually contributed to the
document being selected. Imagine the following:

{ a: "foo bar",
  b: "foo baz",
  c: "foo bat" }

Let's say "a" and "b" are copied to the sink field (default search
field), but "c" is not. The highlighter is asked to highlight all
fields. For a query: "foo" it should return a highlight on "a" and
"b", but not "c". On the other hand, a query "c:foo" should only
highlight "c". In other words -- the user should clearly see which
fields actually contributed to the document being part of the search
result. requireFieldMatch=false is a really crude cannon to solve
this.

> Yeah I already explained why your snippet-centering requirement simply can't
> be met with the UH.

Thanks, I thought so. We actually have a custom highlighter (unrelated
to Solr) in our commercial product that works on a slightly different
basis than what can be found in Lucene (I think). The pipeline there
is as follows:

1) determine "highlight" offset ranges (from, to, type). Highlight
"types" can be different so that, for example, one can highlight two
queries at once (and they can overlap in all kinds of ways).
2) process highlight ranges so that they're hierarchically nested
(split non-tree-like overlaps into hierarchical descents). This
permits emitting easier html markup later on.
3) expand each highlight range to fit certain criteria (typically the
desired length of the snippet), this expansion here uses a break
iterator (on words) and respects certain hard limits (like value
boundaries for multivalue fields);
4) score each such expanded range; the scoring formula checks if there
are any other highlights that fall within the same window; if so, they
receive a higher score. This results in multi-term matches typically
ending up at the top of the scoring list.
5) emit the best scoring ranges, marking highlights properly.

We actually use UnifiedHighlighter for the first step above, the rest
is custom. It can be used to pretty much highlight anything since the
inputs are the text itself and the ranges to highlight (offsets +
type). Note it doesn't solve the problem of the default field
highlighting -- this is something that'd have to be addressed
separately, but it's been working for us fairly well in practice.

I'd be glad to contribute this code back to Lucene, but it's kind of
detached from the infrastructure and it'd require some work to
integrate. :(

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message