Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <15005878.1192506830841.JavaMail.jira@brutus>
Date: Mon, 15 Oct 2007 20:53:50 -0700 (PDT)
From: "Mark Miller (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1029) Illegal character replacements in
 ISOLatin1AccentFilter
In-Reply-To: <9663580.1192433211107.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-1029?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535049 ]=20

Mark Miller commented on LUCENE-1029:
-------------------------------------

My comment about stemming was not meant to compare a stemmer to a diacritic=
al stripper, but rather to point out that the result of such an operation d=
oes not necessarily have to create something 'legal' (just as a stemmer doe=
s not create 'legal' words). This was in response to the comment 'Some of t=
he ISOLatin1AccentFilter are legal while others are illegal. '

Your point about semantic meaning is well taken, but was not intended to be=
 part of the comparison I was going for. My bad.=20

I think that the fact that ripping diacriticals can change the meaning of w=
ords goes without saying...otherwise, why even have them in the language? A=
s Uwe said, the main motivating factor is to allow easy entry with the keyb=
oard of another language. Of course this must come with a compromise. Other=
 search engines I have seen offer the exact functionality of this class. (C=
PL, SearchServer, etc)

Literally, this thing is called an accent filter...letters go in, accents c=
ome off. Doing more really does seem like a job for another class. If I can=
 borrow a word I didn't know from DM Smith, transliteration seems to go bey=
ond an ISOLatin1AccentFilter. This is a tough sell I know -- programmers se=
em to push the definition of filter to its limits and IMHO into the realm o=
f transform/translate.

Anyhow...I apologize for beating a dead horse...<g>

> Illegal character replacements in ISOLatin1AccentFilter
> -------------------------------------------------------
>
>                 Key: LUCENE-1029
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1029
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.2
>            Reporter: Marko Asplund
>
> The ISOLatin1AccentFilter class is responsible for replacing "accented ch=
aracters in the ISO Latin 1 character set by their unaccented equivalent".
> Some of the replacements performed for scandinavian characters (used e.g.=
 in the finnish, swedish, danish languages etc.) are illegal. The scandinav=
ian characters are different from the accented characters used e.g. in lati=
n based languages such as french in that these characters (=C3=A4, =C3=B6, =
=C3=A5) represent entirely independent sounds in the language and therefore=
 cannot be represented with any other sound without change of meaning. It i=
s therefore illegal to replace these characters with any other character.
> This means for example that you can't change the finnish word s=C3=A4=C3=
=A4 (weather) to saa (will have) because these are two entirely different w=
ords with different meaning. The same applies to scandinavian languages as =
well.
> There's no connection between the sounds represented by =C3=A4 and a; =C3=
=B6 and o or =C3=A5 and a.=20
> In addition to the three characters mentioned above danish and norwegian =
use other special characters such as =C3=B8 and =C3=A6. It should be checke=
d if the replacement is legal for these characters.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org