Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 50828 invoked from network); 16 Oct 2007 03:54:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Oct 2007 03:54:18 -0000 Received: (qmail 22393 invoked by uid 500); 16 Oct 2007 03:53:59 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 22332 invoked by uid 500); 16 Oct 2007 03:53:59 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 22321 invoked by uid 99); 16 Oct 2007 03:53:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2007 20:53:59 -0700 X-ASF-Spam-Status: No, hits=-100.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Oct 2007 03:54:11 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id CDF1771403F for ; Mon, 15 Oct 2007 20:53:50 -0700 (PDT) Message-ID: <15005878.1192506830841.JavaMail.jira@brutus> Date: Mon, 15 Oct 2007 20:53:50 -0700 (PDT) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1029) Illegal character replacements in ISOLatin1AccentFilter In-Reply-To: <9663580.1192433211107.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1029?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535049 ]=20 Mark Miller commented on LUCENE-1029: ------------------------------------- My comment about stemming was not meant to compare a stemmer to a diacritic= al stripper, but rather to point out that the result of such an operation d= oes not necessarily have to create something 'legal' (just as a stemmer doe= s not create 'legal' words). This was in response to the comment 'Some of t= he ISOLatin1AccentFilter are legal while others are illegal. ' Your point about semantic meaning is well taken, but was not intended to be= part of the comparison I was going for. My bad.=20 I think that the fact that ripping diacriticals can change the meaning of w= ords goes without saying...otherwise, why even have them in the language? A= s Uwe said, the main motivating factor is to allow easy entry with the keyb= oard of another language. Of course this must come with a compromise. Other= search engines I have seen offer the exact functionality of this class. (C= PL, SearchServer, etc) Literally, this thing is called an accent filter...letters go in, accents c= ome off. Doing more really does seem like a job for another class. If I can= borrow a word I didn't know from DM Smith, transliteration seems to go bey= ond an ISOLatin1AccentFilter. This is a tough sell I know -- programmers se= em to push the definition of filter to its limits and IMHO into the realm o= f transform/translate. Anyhow...I apologize for beating a dead horse... > Illegal character replacements in ISOLatin1AccentFilter > ------------------------------------------------------- > > Key: LUCENE-1029 > URL: https://issues.apache.org/jira/browse/LUCENE-1029 > Project: Lucene - Java > Issue Type: Bug > Components: Analysis > Affects Versions: 2.2 > Reporter: Marko Asplund > > The ISOLatin1AccentFilter class is responsible for replacing "accented ch= aracters in the ISO Latin 1 character set by their unaccented equivalent". > Some of the replacements performed for scandinavian characters (used e.g.= in the finnish, swedish, danish languages etc.) are illegal. The scandinav= ian characters are different from the accented characters used e.g. in lati= n based languages such as french in that these characters (=C3=A4, =C3=B6, = =C3=A5) represent entirely independent sounds in the language and therefore= cannot be represented with any other sound without change of meaning. It i= s therefore illegal to replace these characters with any other character. > This means for example that you can't change the finnish word s=C3=A4=C3= =A4 (weather) to saa (will have) because these are two entirely different w= ords with different meaning. The same applies to scandinavian languages as = well. > There's no connection between the sounds represented by =C3=A4 and a; =C3= =B6 and o or =C3=A5 and a.=20 > In addition to the three characters mentioned above danish and norwegian = use other special characters such as =C3=B8 and =C3=A6. It should be checke= d if the replacement is legal for these characters. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org