lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
Date Mon, 07 Feb 2011 10:11:30 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991319#comment-12991319
] 

Uwe Schindler commented on LUCENE-2909:
---------------------------------------

The problem has nothing to do with CharFilters. This problem always occurs, if endOffset -
startOffset != termAtt.length().

If you e.g. put a Stemmer before ngramming, that creates longer tokens (like Portugise -ã
-> -ão or German ß -> ss) you have the same problem. A solution might be to use some
"factor" to correct this in these offsets: (endOffset - startOffset) / termAtt.length()

> NGramTokenFilter may generate offsets that exceed the length of original text
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-2909
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2909
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4
>            Reporter: Shinya Kasatani
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>         Attachments: TokenFilterOffset.patch
>
>
> Whan using NGramTokenFilter combined with CharFilters that lengthen the original text
(such as "ß" -> "ss"), the generated offsets exceed the length of the origianal text.
> This causes InvalidTokenOffsetsException when you try to highlight the text in Solr.
> While it is not possible to know the accurate offset of each character once you tokenize
the whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid
generating invalid offsets.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message