lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evert Wagenaar <evert.wagen...@gmail.com>
Subject Re: Highlighting and delineating Passages (fragmenting)
Date Sat, 27 May 2017 20:05:12 GMT
I always assumed this was the default behaviour of the
Lucene TermHighlighter but I could be mistaken with an older version.
I found out that there are major differences between Lucene and Solr
though, with which I have similar problems.

Best regards,

Evert Wagenaar

http://www.evertwagenaar.com/

Op za 27 mei 2017 om 12:08 schreef Dawid Weiss <dawid.weiss@gmail.com>

> Thanks for your explanation, David.
>
> I actually found working with all Lucene highlighters pretty
> difficult. I have a few requirements which seemed deceptively simple:
>
> 1) highlight query hit regions (phrase, fuzzy, terms);
> 2) try to organise the resulting snippets to visually "center" the hit
> regions so that the context of the hit is visible,
> 3) keep the snippet limited to ~x characters (this means breaking on
> word boundaries, typically, but keeping the overall length of the
> snippet close to x).
> 4) add visual cues whether the snippet is part of a larger text
> (ellipsis). This should be done intelligently -- if a snippet is
> actually the whole field or start/ends on the field boundary no
> ellipsis should be added.
> 5) For performance reasons we typically have a single copy-to field
> that is used as the default field for the query parser. But for the
> user interface needs we'd have to go back and try to highlight the
> original fields that formed this content. This is probably the most
> difficult and I didn't expect it to be solved with existing
> highlighters, but it'd be a great thing to have eventually.
>
> Some of the above are possible with existing highlighters, some are
> not. Having a limited snippet length and keeping word bounary breaks
> turned to be most confusing to me with unified highlighter, for
> example. I can't use the sentence break iterator because the text in
> question occasionally has super-long word sequences that result in
> snippets that are enormous.
>
> I'll keep thinking.
>
> Dawid
>
> On Fri, May 26, 2017 at 3:57 PM, David Smiley <david.w.smiley@gmail.com>
> wrote:
> > I was recently asked if/how the UnifiedHighlighter can return a Passage
> > centered around the highlighted words.  I'm responding to a wider
> audience
> > (java-user list, ...).
> >
> > Each highlighter implementation fragments the content into passages (with
> > highlights) using a different algorithm.
> >
> > The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
> > derives) fragment the content to create passages entirely based on a
> > java.text.BreakIterator.  BreakIterator only sees/knows about the content
> > (it's initialized with it via setText(string); it doesn't know where
> > highlighted words are.  This is why the default UH BreakIterator impl is
> a
> > sentence based one and most people probably will let it be.  Given how
> the
> > UH actually uses the BreakIterator, you can create a custom one that is
> only
> > designed to work with this highlighter that makes some assumptions of how
> > it's used, resulting in some fragmentation that isn't so rigidly based on
> > the content.  The LengthGoalBreakIterator is such a BreakIterator.  But
> it
> > can only "see" the first highlighted word of a passage and make
> > fragmentation decisions based on that alone.
> >
> > The other two highlighters (the original Highlighter and I think the
> > FastVectorHighlighter) are more flexible in this regard; they have their
> own
> > abstraction that allows for Passages to be formed sensitive to where
> exactly
> > the highlighted words are.  Thus you could fairly easily achieve a goal
> of
> > say, 10 words before the first highlighted word, and highlight more words
> > within 10 words of each other until the next is too far away, then 10
> more
> > trailing words with the original Highlighter.  I suspect
> > FastVectorHighlighter can do it this but its API confuses me.  The
> > FastVectorHighlighter also uses a BreakIterator in
> > BreakIteratorBoundaryScanner but it's use is entirely different from how
> the
> > UnifiedHighlighter uses one.
> >
> > Perhaps the UnifiedHighlighter should be enhanced to make more flexible
> > fragmentation algorithms possible.  Today you'd need to override
> > FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone;
> even
> > doing that is annoying and then re-implemenitng that method is onerous
> since
> > it's so complex -- it's really the heart of the UH.  The UH could add an
> > entirely new abstraction apart from BreakIterators (with a BI based impl
> > available), or perhaps an optional marker interface for UH-aware
> > BreakIterators.  The former (a new abstraction) would be cleaner, and
> might
> > also remove a wart in the API due to the statefulness of BreakIterators.
> > It's also kinda hard to write a BI correctly. I've implemented a few
> already
> > and I know.  It's an old API.
> >
> > ~ David
> >
> > --
> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> > http://www.solrenterprisesearchserver.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> --
Sent from Gmail IPad

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message