lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: Highlighting and delineating Passages (fragmenting)
Date Sat, 27 May 2017 10:08:09 GMT
Thanks for your explanation, David.

I actually found working with all Lucene highlighters pretty
difficult. I have a few requirements which seemed deceptively simple:

1) highlight query hit regions (phrase, fuzzy, terms);
2) try to organise the resulting snippets to visually "center" the hit
regions so that the context of the hit is visible,
3) keep the snippet limited to ~x characters (this means breaking on
word boundaries, typically, but keeping the overall length of the
snippet close to x).
4) add visual cues whether the snippet is part of a larger text
(ellipsis). This should be done intelligently -- if a snippet is
actually the whole field or start/ends on the field boundary no
ellipsis should be added.
5) For performance reasons we typically have a single copy-to field
that is used as the default field for the query parser. But for the
user interface needs we'd have to go back and try to highlight the
original fields that formed this content. This is probably the most
difficult and I didn't expect it to be solved with existing
highlighters, but it'd be a great thing to have eventually.

Some of the above are possible with existing highlighters, some are
not. Having a limited snippet length and keeping word bounary breaks
turned to be most confusing to me with unified highlighter, for
example. I can't use the sentence break iterator because the text in
question occasionally has super-long word sequences that result in
snippets that are enormous.

I'll keep thinking.

Dawid

On Fri, May 26, 2017 at 3:57 PM, David Smiley <david.w.smiley@gmail.com> wrote:
> I was recently asked if/how the UnifiedHighlighter can return a Passage
> centered around the highlighted words.  I'm responding to a wider audience
> (java-user list, ...).
>
> Each highlighter implementation fragments the content into passages (with
> highlights) using a different algorithm.
>
> The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
> derives) fragment the content to create passages entirely based on a
> java.text.BreakIterator.  BreakIterator only sees/knows about the content
> (it's initialized with it via setText(string); it doesn't know where
> highlighted words are.  This is why the default UH BreakIterator impl is a
> sentence based one and most people probably will let it be.  Given how the
> UH actually uses the BreakIterator, you can create a custom one that is only
> designed to work with this highlighter that makes some assumptions of how
> it's used, resulting in some fragmentation that isn't so rigidly based on
> the content.  The LengthGoalBreakIterator is such a BreakIterator.  But it
> can only "see" the first highlighted word of a passage and make
> fragmentation decisions based on that alone.
>
> The other two highlighters (the original Highlighter and I think the
> FastVectorHighlighter) are more flexible in this regard; they have their own
> abstraction that allows for Passages to be formed sensitive to where exactly
> the highlighted words are.  Thus you could fairly easily achieve a goal of
> say, 10 words before the first highlighted word, and highlight more words
> within 10 words of each other until the next is too far away, then 10 more
> trailing words with the original Highlighter.  I suspect
> FastVectorHighlighter can do it this but its API confuses me.  The
> FastVectorHighlighter also uses a BreakIterator in
> BreakIteratorBoundaryScanner but it's use is entirely different from how the
> UnifiedHighlighter uses one.
>
> Perhaps the UnifiedHighlighter should be enhanced to make more flexible
> fragmentation algorithms possible.  Today you'd need to override
> FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone; even
> doing that is annoying and then re-implemenitng that method is onerous since
> it's so complex -- it's really the heart of the UH.  The UH could add an
> entirely new abstraction apart from BreakIterators (with a BI based impl
> available), or perhaps an optional marker interface for UH-aware
> BreakIterators.  The former (a new abstraction) would be cleaner, and might
> also remove a wart in the API due to the statefulness of BreakIterators.
> It's also kinda hard to write a BI correctly. I've implemented a few already
> and I know.  It's an old API.
>
> ~ David
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message