lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ibrahim (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4216) Token X exceeds length of provided text sized X
Date Sun, 05 Aug 2012 07:11:02 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428819#comment-13428819
] 

Ibrahim commented on LUCENE-4216:
---------------------------------

Appreciated. It starts working after implementing reset(), end() and use of correctOffset().
For Tashkeel, we should not adjust the offset since it is part of the word but not necessary
to be written when searching/indexing. it is the way how Arabic is written.
I have also another Tokenizer dealing with Arabic by considering the roots where there is
a index of the Arabic roots (>600,000). I might suggest it later to be in the contrib if
you allow the big size of the roots index (16 MB)

Thanks again
                
> Token X exceeds length of provided text sized X
> -----------------------------------------------
>
>                 Key: LUCENE-4216
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4216
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/highlighter
>    Affects Versions: 4.0-ALPHA
>         Environment: Windows 7, jdk1.6.0_27
>            Reporter: Ibrahim
>         Attachments: myApp.zip
>
>
> I'm facing this exception:
> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token رأيكم exceeds
length of provided text sized 170
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
> 	at classes.myApp$16$1.run(myApp.java:1508)
> I tried to find anything wrong in my code when i start migrating Lucene 3.6 to 4.0 without
successful. i found similar issues with HTMLStripCharFilter e.g. LUCENE-3690, LUCENE-2208
but not with SimpleHTMLFormatter so I'm triggering this here to see if there is really a bug
or it is something wrong in my code with v4. The code that im using:
> final Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter("<font color=red>",
"</font>"), new QueryScorer(query));
> .......
> final TokenStream tokenStream = TokenSources.getAnyTokenStream(defaultSearcher.getIndexReader(),
j, "Line", analyzer);
> final TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, doc.get("Line"),
false, 10);
> Please note that this is working fine with v3.6

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message