lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1793) remove custom encoding support in Greek/Russian Analyzers
Date Sun, 09 Aug 2009 15:42:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741098#action_12741098
] 

Uwe Schindler commented on LUCENE-1793:
---------------------------------------

I would also strongly suggest to remove these custom charsets. They are not unicode conform,
because they use char codepoint mappings that simply define an US ASCII char for some of the
input chars. The problems begin with mixed language texts.
This strange (and wrong) mapping can also be seen in the tests: Tests load a KOI-8 file with
encoding ISO-8859-1 (to get the native bytes as chars) and then map it. This is very bad!
The analyzers should really only work on unicode codepoints and nothing more. For backwards
compatibility with old indexes (that are encoded using this strange mapping), we have to preserve
the charsets for a while, but deprecate all of them and only leave UTF-16 as input (java chars).

You are right, to reduce index size, it would be good, to also support other encodings in
addition to UTF-8 for storage of term text.

> remove custom encoding support in Greek/Russian Analyzers
> ---------------------------------------------------------
>
>                 Key: LUCENE-1793
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1793
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Priority: Minor
>
> The Greek and Russian analyzers support custom encodings such as KOI-8, they define things
like Lowercase and tokenization for these.
> I think that analyzers should support unicode and that conversion/handling of other charsets
belongs somewhere else. 
> I would like to deprecate/remove the support for these other encodings.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message