lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-1822) FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive
Date Wed, 23 Jan 2013 09:34:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13560512#comment-13560512
] 

Simon Willnauer commented on LUCENE-1822:
-----------------------------------------

Hey Koji, 

I just tracked down a changed behavior to this issue. I think this is a major change in Runtime
Behavior / BW Compatibility but I only see this listed as a bugfix with almost no info attached
in CHANGES.txt. I think we should really document this change here in the CHANGES.TXT file
since a lot of users might be affected. Don't get me wrong I think this change is a very good
change and makes the behavior more intuitive but I really spend a long time to figure out
why my tests failed.
                
> FastVectorHighlighter: SimpleFragListBuilder hard-coded 6 char margin is too naive
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-1822
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1822
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>    Affects Versions: 2.9
>         Environment: any
>            Reporter: Alex Vigdor
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 4.1, 5.0
>
>         Attachments: LUCENE-1822.patch, LUCENE-1822.patch, LUCENE-1822.patch, LUCENE-1822-tests.patch
>
>
> The new FastVectorHighlighter performs extremely well, however I've found in testing
that the window of text chosen per fragment is often very poor, as it is hard coded in SimpleFragListBuilder
to always select starting 6 characters to the left of the first phrase match in a fragment.
 When selecting long fragments, this often means that there is barely any context before the
highlighted word, and lots after; even worse, when highlighting a phrase at the end of a short
text the beginning is cut off, even though the entire phrase would fit in the specified fragCharSize.
 For example, highlighting "Punishment" in "Crime and Punishment"  returns "e and <b>Punishment</b>"
no matter what fragCharSize is specified.  I am going to attach a patch that improves the
text window selection by recalculating the starting margin once all phrases in the fragment
have been identified - this way if a single word is matched in a fragment, it will appear
in the middle of the highlight, instead of 6 characters from the beginning.  This way one
can also guarantee that the entirety of short texts are represented in a fragment by specifying
a large enough fragCharSize.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message