lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4216) Token X exceeds length of provided text sized X
Date Mon, 06 Aug 2012 08:31:02 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429024#comment-13429024
] 

Uwe Schindler commented on LUCENE-4216:
---------------------------------------

Hi,

{code:java}
/** A tokenizer that will return tokens in the arabic alphabet. This tokenizer
 * is a bit rude since it also filters digits and punctuation, even in an arabic
 * part of stream. Well... I've planned to write a
 * "universal", highly configurable, character tokenizer.
 * @author Pierrick Brihaye, 2003
 */
{code}

You don't need to implement your own ArabicTokenizer, just subclass the abstract Lucene class
CharTokenizer which has all the functionality this comment in your source code offers. The
change is easy: Subclass directly and remove all code exept isArabicChar and rename this method
to isTokenChar (it takes int not char, but thats just a cast). The Tashkel stuff should be
done with PatternReplaceFilter wrapped on top of this Tokenizer, there is no need to have
this in the Tokenizer itsself and makes code complex. Then you can 100% be sure that all offsets
are correct, the code you use is a duüplicate and it is too risky to reinvent the wheel if
a well-tested variant is available with the Lucene distribution. It is much easier, trust
me, no need to implement any crazy reset,... methods!
                
> Token X exceeds length of provided text sized X
> -----------------------------------------------
>
>                 Key: LUCENE-4216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4216
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0-ALPHA
>         Environment: Windows 7, jdk1.6.0_27
>            Reporter: Ibrahim
>         Attachments: ArabicTokenizer.java, myApp.zip
>
>
> I'm facing this exception:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token رأيكم exceeds
length of provided text sized 170
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
> 	at classes.myApp$16$1.run(myApp.java:1508)
> I tried to find anything wrong in my code when i start migrating Lucene 3.6 to 4.0 without
successful. i found similar issues with HTMLStripCharFilter e.g. LUCENE-3690, LUCENE-2208
but not with SimpleHTMLFormatter so I'm triggering this here to see if there is really a bug
or it is something wrong in my code with v4. The code that im using:
> final Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter("<font color=red>",
"</font>"), new QueryScorer(query));
> .......
> final TokenStream tokenStream = TokenSources.getAnyTokenStream(defaultSearcher.getIndexReader(),
j, "Line", analyzer);
> final TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, doc.get("Line"),
false, 10);
> Please note that this is working fine with v3.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message