lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smiley <david.w.smi...@gmail.com>
Subject Re: Highlighting and delineating Passages (fragmenting)
Date Tue, 30 May 2017 12:54:26 GMT
Looks like you should use the original Highlighter until requirement #2,3
can be done with the UnifiedHighlighter.  Other than #2,3, the UH can
handle all these requirements, and the OH can do all.

On Sat, May 27, 2017 at 6:08 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Thanks for your explanation, David.
>
> I actually found working with all Lucene highlighters pretty
> difficult. I have a few requirements which seemed deceptively simple:
>
> 1) highlight query hit regions (phrase, fuzzy, terms);
>

They all do this (not considering the now removed PostingsHighlighter).


> 2) try to organise the resulting snippets to visually "center" the hit
> regions so that the context of the hit is visible,
> 3) keep the snippet limited to ~x characters (this means breaking on
> word boundaries, typically, but keeping the overall length of the
> snippet close to x).
>

#2 & #3 is the same requirement; you elaborate on #2 with more detail in #3.
The UH can't currently do this; but with the OH (original Highlighter) you
can but it appears somewhat awkward.  See SimpleSpanFragmenter.  I had said
it was easy but I was mistaken; I'm getting rustier on the OH.


> 4) add visual cues whether the snippet is part of a larger text
> (ellipsis). This should be done intelligently -- if a snippet is
> actually the whole field or start/ends on the field boundary no
> ellipsis should be added.
>

With the original Highlighter, you get TextFragment instances which contain
the textStartPos & textEndPos.  You can use that info to conditionally add
ellipsis.


> 5) For performance reasons we typically have a single copy-to field
> that is used as the default field for the query parser. But for the
> user interface needs we'd have to go back and try to highlight the
> original fields that formed this content. This is probably the most
> difficult and I didn't expect it to be solved with existing
> highlighters, but it'd be a great thing to have eventually.
>

With the original Highlighter, you can easily do this by providing the
stored text.  When you create the QueryScorer, use "null" for field name to
highlight all query fields.  The UH can do this as well by highlighting the
fields that are stored, and call setFieldMatcher to provide a Predicate
that always return true.


> Some of the above are possible with existing highlighters, some are
> not. Having a limited snippet length and keeping word bounary breaks
> turned to be most confusing to me with unified highlighter, for
> example. I can't use the sentence break iterator because the text in
> question occasionally has super-long word sequences that result in
> snippets that are enormous.
>

Yeah I already explained why your snippet-centering requirement simply
can't be met with the UH.  Perhaps it might help if the UH documented it's
overall algorithm a bit more, even if it all the documentation in the world
won't enable it to meet your requirement. At least it would help you know
sooner if it can or not :-)

~ David
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message