lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Earle (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-6334) Fast Vector Highlighter does not properly span neighboring term offsets
Date Wed, 04 Mar 2015 00:33:04 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Earle updated LUCENE-6334:
--------------------------------
    Description: 
If you are using term vectors for fast vector highlighting along with a multivalue field while
matching a phrase that crosses two elements, then it will not properly highlight even though
it _properly_ finds the correct values to highlight.

A good example of this is when matching source code, where you might have lines like:

{code}
one two three five
two three four
five six five
six seven eight nine eight nine eight nine eight nine eight nine eight nine
eight nine
ten eleven
twelve thirteen
{code}

Matching the phrase "four five" will return

{code}
two three four
five six five
six seven eight nine eight nine eight nine eight nine eight
eight nine
ten eleven
{code}

However, it does not properly highlight "four" (on the first line) and "five" (on the second
line) _and_ it is returning too many lines, but not all of them.

The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
because it is not checking for cross-coverage. Here is a possible solution:

{code}
boolean started = toffs.getStartOffset() >= fieldStart;
boolean ended = toffs.getEndOffset() <= fieldEnd;

// existing behavior:
if (started && ended) {
    toffsList.add(toffs);
    toffsIterator.remove();
}
else if (started) {
    toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
    // toffsIterator.remove(); // is this necessary?
}
else if (ended) {
    toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
    // toffsIterator.remove(); // is this necessary?
}
else if (toffs.getEndOffset() > fieldEnd) {
    // ie the toff spans whole field
    toffsList.add(new Toffs(fieldStart, fieldEnd));
    // toffsIterator.remove(); // is this necessary?
}
{code}

  was:
If you are using term vectors for fast vector highlighting along with a multivalue field while
matching a phrase that crosses two elements, then it will not properly highlight even though
it _properly_ finds the correct values to highlight.

A good example of this is when matching source code, where you might have lines like:

{code}
one two three five
two three four
five six five
six seven eight nine eight nine eight nine eight nine eight nine eight nine
eight nine
ten eleven
twelve thirteen
{code}

Matching the phrase "four five" will return

{code}
two three four
five six five
six seven eight nine eight nine eight nine eight nine eight
eight nine
ten eleven
{code}

However, it does not properly highlight "four" (on the first line) and "five" (on the second
line) _and_ it is returning too many lines, but not all of them.

The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
because it is not checking for cross-coverage:

{code}
boolean started = toffs.getStartOffset() >= fieldStart;
boolean ended = toffs.getEndOffset() <= fieldEnd;

// existing behavior:
if (started && ended) {
    toffsList.add(toffs);
    toffsIterator.remove();
}
else if (started) {
    toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
    // toffsIterator.remove(); // is this necessary?
}
else if (ended) {
    toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
    // toffsIterator.remove(); // is this necessary?
}
else if (toffs.getEndOffset() > fieldEnd) {
    // ie the toff spans whole field
    toffsList.add(new Toffs(fieldStart, fieldEnd));
    // toffsIterator.remove(); // is this necessary?
}
{code}


> Fast Vector Highlighter does not properly span neighboring term offsets
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-6334
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6334
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/termvectors, modules/highlighter
>            Reporter: Chris Earle
>              Labels: easyfix
>
> If you are using term vectors for fast vector highlighting along with a multivalue field
while matching a phrase that crosses two elements, then it will not properly highlight even
though it _properly_ finds the correct values to highlight.
> A good example of this is when matching source code, where you might have lines like:
> {code}
> one two three five
> two three four
> five six five
> six seven eight nine eight nine eight nine eight nine eight nine eight nine
> eight nine
> ten eleven
> twelve thirteen
> {code}
> Matching the phrase "four five" will return
> {code}
> two three four
> five six five
> six seven eight nine eight nine eight nine eight nine eight
> eight nine
> ten eleven
> {code}
> However, it does not properly highlight "four" (on the first line) and "five" (on the
second line) _and_ it is returning too many lines, but not all of them.
> The problem lies in the [BaseFragmentsBuilder at line 269| https://github.com/apache/lucene-solr/blob/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.java#L269]
because it is not checking for cross-coverage. Here is a possible solution:
> {code}
> boolean started = toffs.getStartOffset() >= fieldStart;
> boolean ended = toffs.getEndOffset() <= fieldEnd;
> // existing behavior:
> if (started && ended) {
>     toffsList.add(toffs);
>     toffsIterator.remove();
> }
> else if (started) {
>     toffsList.add(new Toffs(toffs.getStartOffset(), field.end));
>     // toffsIterator.remove(); // is this necessary?
> }
> else if (ended) {
>     toffsList.add(new Toffs(fieldStart, toff.getEndOffset()));
>     // toffsIterator.remove(); // is this necessary?
> }
> else if (toffs.getEndOffset() > fieldEnd) {
>     // ie the toff spans whole field
>     toffsList.add(new Toffs(fieldStart, fieldEnd));
>     // toffsIterator.remove(); // is this necessary?
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message