lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
Date Mon, 07 Feb 2011 09:51:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991316#comment-12991316
] 

Robert Muir commented on LUCENE-2909:
-------------------------------------

Is the bug really in NGramTokenFilter? 

This seems to be a larger problem that would affect all tokenfilters that break larger tokens
into smaller ones and recalculate offsets, right?

For example: EdgeNGramTokenFilter, ThaiWordFilter, SmartChineseAnalyzer's WordTokenFilter,
etc?

I think WordDelimiterFilter has special code that might avoid the problem (line 352), so it
might
be ok.

Is there any better way we could solve this: for example maybe instead of the tokenizer calling
correctOffset() it gets called somewhere else? This seems to be what is causing the problem.


> NGramTokenFilter may generate offsets that exceed the length of original text
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-2909
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2909
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4
>            Reporter: Shinya Kasatani
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>         Attachments: TokenFilterOffset.patch
>
>
> Whan using NGramTokenFilter combined with CharFilters that lengthen the original text
(such as "ß" -> "ss"), the generated offsets exceed the length of the origianal text.
> This causes InvalidTokenOffsetsException when you try to highlight the text in Solr.
> While it is not possible to know the accurate offset of each character once you tokenize
the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid
generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message