lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose
Date Thu, 02 May 2019 13:02:00 GMT


Simon Willnauer commented on LUCENE-8776:

[~venkat11] I do understand your frustration. Believe me, we don't take changes like this
easily. One persons bug is another persons feature and as we grow and mature strong guarantess
are essential for a vast majority of users, for future developments for faster iterations
and more performant code. There might not be a tradeoff from your perspective, from the maintainers
perspective there is. Now we can debate if a major version bump is _enough_ time to migrate
or not, our policy is that we can make BWC and behavioral changes like this in a major release.
In-fact we don't do it in minors to provide you the time you need and to easy upgrades to
minors. We will and have build features on top of this guarantee and in order to manage expectations
I am pretty sure we won't go back an allow negative offsets. I think your best option, if
you like it or not, is to work towards a fix for your issue with either the tools you have
now or improve lucene for instance with the suggestion from [~mgibney] regarding indexing
more information. 

Please don't get mad at me, I am just trying to manage expectations. 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>                 Key: LUCENE-8776
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run span queries
and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which allows me to
search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions
in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from
'diode' need to match this word. So, the order of words after splitting are: Organic, light,
emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or
'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two positions: (a)
In the same position as 'light' and (b) in the same position as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously
the same. This works beautifully in Lucene 5.x in both searching and highlighting with span
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards"
at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments
on why this check is needed. As I explained above, startOffset going backwards is perfectly
valid, to deal with word splitting and span operations on these specialized use cases. On
the other hand, it is not clear what value is added by this check and which highlighter code
is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents
legitimate use cases. Can this check be removed?  

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message