lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1702) Thai token type() bug
Date Fri, 19 Jun 2009 17:00:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721877#action_12721877
] 

Steven Rowe commented on LUCENE-1702:
-------------------------------------

bq. I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity.

+0, in that the arrival time for 1.5.0 is unknown, but I'll defer to your judgment.

bq. for reference (haven't looked at jflex), above-bmp support might require new data structures.
I think ICU uses things like tries / compactarrays to deal with the fact you have thousands
of codepoints with the same property value, etc.

Thanks for the heads-up.  The above-BMP property values for the currently supported properties
are now encoded on the 1.5 branch as range pairs (they just aren't accessible yet because
of the BMP limit).  Since JFlex is a regular expression engine, code for handling large character
sets (as sets of ranges) is already built-in, so I don't anticipate this will be a problem.
 The main thing will just be to switch from char to int for character representation.

> Thai token type() bug
> ---------------------
>
>                 Key: LUCENE-1702
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1702
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type
Thai numeric digits correctly.
> ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which
adds the entire [:Thai:] block to ALPHANUM.
> i propose that alphanum be described a little bit differently in the grammar.
> Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached
to it.
> this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer
to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message