lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shinya Kasatani (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-2909) NGramTokenFilter may generate offsets that exceed the length of original text
Date Mon, 07 Feb 2011 05:14:30 GMT
NGramTokenFilter may generate offsets that exceed the length of original text
-----------------------------------------------------------------------------

                 Key: LUCENE-2909
                 URL: https://issues.apache.org/jira/browse/LUCENE-2909
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 2.9.4
            Reporter: Shinya Kasatani
            Priority: Minor


Whan using NGramTokenFilter combined with CharFilters that lengthen the original text (such
as "ß" -> "ss"), the generated offsets exceed the length of the origianal text.
This causes InvalidTokenOffsetsException when you try to highlight the text in Solr.

While it is not possible to know the accurate offset of each character once you tokenize the
whole text with tokenizers like KeywordTokenizer, NGramTokenFilter should at least avoid generating
invalid offsets.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message