lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Duffy (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments
Date Wed, 17 Sep 2008 14:41:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrew Duffy updated LUCENE-1389:
---------------------------------

    Attachment: positions.patch

I've attached another diff, again from the trunk version. There is a slight optimisation -
the span loop is broken early when a span is found at the current position.

The main change is to start(String), though. Previously, it set currentPosition to 0, meaning
every position was off by one and spans were not matched. It now starts currentPosition at
-1 so the first token position ends up 0 as it should.

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Priority: Minor
>         Attachments: positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a
hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined
with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very
short; possibly even as short as just the span or phrase itself. This is the result of creating
a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment
size unless a span or phrase hit is towards the end of the fragment - that fragment is made
larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message