lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <>
Subject [jira] [Updated] (LUCENE-1824) FastVectorHighlighter truncates words at beginning and end of fragments
Date Tue, 06 Sep 2011 03:13:10 GMT


Koji Sekiguchi updated LUCENE-1824:

    Attachment: LUCENE-1824.patch

First draft. I introduced BoundaryScanner interface and two implementations of the interface,
Simple and BreakIterator.

SimpleBoundaryScanner uses the following default boundary chars:

public static final Character[] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?', '(', '[', '{',
'\t', '\n'};

And they are used by SimpleBoundaryScanner to find word/sentence boundary.

BreakIteratorBoundaryScanner can also be used to find the break of char/word/sentence/line.

I made BaseFragmentsBuilder boundary-aware, rather than creating a new FragmentsBuilder something
like BoundaryAwareFragmentsBuilder. As a result, all FragmentsBuilder is now boundary-aware
natively, as long as using an appropriate BoundaryScanner.

I've not touched test yet. Because this patch changes fragments boundaries, the existing test
should go fail!

> FastVectorHighlighter truncates words at beginning and end of fragments
> -----------------------------------------------------------------------
>                 Key: LUCENE-1824
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>         Environment: any
>            Reporter: Alex Vigdor
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 4.0
>         Attachments: LUCENE-1824.patch, LUCENE-1824.patch
> FastVectorHighlighter does not take word boundaries into consideration when building
fragments, so that in most cases the first and last word of a fragment are truncated.  This
makes the highlights less legible than they should be.  I will attach a patch to BaseFragmentBuilder
that resolves this by expanding the start and end boundaries of the fragment to the first
whitespace character on either side of the fragment, or the beginning or end of the source
text, whichever comes first.  This significantly improves legibility, at the cost of returning
a slightly larger number of characters than specified for the fragment size.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message