lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets
Date Wed, 24 Oct 2018 11:37:00 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alan Woodward updated LUCENE-8509:
----------------------------------
    Attachment: LUCENE-8509.patch

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards
offsets
> ----------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8509
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8509
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the beginning
of the second token).  The WDGF takes the first token and splits it into two, adjusting the
offsets of the second token, so we get "a"[0,1] and "b"[2,3].  The trim filter removes the
leading space from the second token, leaving offsets unchanged, so WDGF sees "bb"[1,4]; because
the leading space has already been stripped, WDGF sees no need to adjust offsets, and emits
the token as-is, resulting in the start offsets of the tokenstream being [0, 2, 1], and the
IndexWriter rejecting it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message