lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Wettin (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter
Date Tue, 16 Oct 2007 02:21:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535040
] 

Karl Wettin commented on LUCENE-1029:
-------------------------------------

Hoss Man - 15/Oct/07 02:47 PM
> the equivalence described in the javadocs is one of visual character equivalence, not
of semantic word equivalence - that would be a lot more complicated. if anyone would like
to submit a patch contianing a new filter that is capable of doing that, i'm sure the community
would certianly welcome it.

I think you missunderstand why I focused on the stemmer. My point was that this filter can
not be compared with stmmer as in earlier posts.

I do not think that the documentation is missleadning, nor do I think there is any need to
break the backwards compability. All I say is that I welcome a solution that makes this filter
more configurable. Not sure what a smart way to do that would be though. I'm open for a discussion.
Perhaps one could feed it with exceptions, perhaps a per language definition, perhaps something
else?

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented characters in
the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g. in the finnish,
swedish, danish languages etc.) are illegal. The scandinavian characters are different from
the accented characters used e.g. in latin based languages such as french in that these characters
(ä, ö, å) represent entirely independent sounds in the language and therefore cannot be
represented with any other sound without change of meaning. It is therefore illegal to replace
these characters with any other character.
> This means for example that you can't change the finnish word sää (weather) to saa
(will have) because these are two entirely different words with different meaning. The same
applies to scandinavian languages as well.
> There's no connection between the sounds represented by ä and a; ö and o or å and
a. 
> In addition to the three characters mentioned above danish and norwegian use other special
characters such as ø and æ. It should be checked if the replacement is legal for these characters.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message