lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ram Venkat (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8776) Start offset going backwards has a legitimate purpose
Date Wed, 01 May 2019 00:20:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830764#comment-16830764
] 

Ram Venkat edited comment on LUCENE-8776 at 5/1/19 12:19 AM:
-------------------------------------------------------------

[~mikemccand] - You are mischaracterizing a long standing Lucene feature as a "bug". Offsets
going backwards worked exactly as we wanted. But, it's not worth getting into that cliche. 
  
 I am not saying that this feature should not be retired, if it adds great value to do so.
But, users should be given the time to migrate their implementations to use alternate methods.
That is just a standard practice in maintaining any product or library, especially a mature
library like Lucene. Hence, there should be a significant period of time, where users can
bypass that check that prevents indexing such documents (with negative offsets). 
  
 About us enhancing our query parser, it is not trivial. I am not sure whether Lucene standard
query parser (or whatever you are referring to), can deal with the combination of wildcards
and term distance. For example, "light* adjacent_to glows" should match "light-emitting-diode
glows". This can be done in our parser, but just not a small enough task for us to do as part
of a version upgrade. This is why we need time to do this. 


was (Author: venkat11):
[~mikemccand] - You are mischaracterizing a long standing Lucene feature as a "bug". Offsets
going backwards worked exactly as we wanted. But, it's not worth getting into that cliche. 
 
I am not saying that this feature should not be retired, if it adds great value to do so.
But, users should be given the time to migrate their implementations to use alternate methods.
That is just a standard practice in maintaining any product or library, especially a mature
library like Lucene. Hence, there should be a significant period of time, where users can
bypass that check that prevents indexing such documents (with negative offsets). 
 
About us enhancing our query parser, it is not trivial. I am not sure whether Lucene standard
query parser (or whatever you are referring to), will deal with the combination of wildcards
and term distance. For example, "light* adjacent_to glows" should match "light-emitting-diode
glows". This can be done in our parser, but just not a small enough task for us to do as part
of a version upgrade. This is why we need time to do this. 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run span queries
and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which allows me to
search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions
in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from
'diode' need to match this word. So, the order of words after splitting are: Organic, light,
emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or
'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two positions: (a)
In the same position as 'light' and (b) in the same position as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously
the same. This works beautifully in Lucene 5.x in both searching and highlighting with span
queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards"
at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments
on why this check is needed. As I explained above, startOffset going backwards is perfectly
valid, to deal with word splitting and span operations on these specialized use cases. On
the other hand, it is not clear what value is added by this check and which highlighter code
is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents
legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message