Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DFEEFD8FA for ; Wed, 22 May 2013 12:38:38 +0000 (UTC) Received: (qmail 84568 invoked by uid 500); 22 May 2013 12:38:37 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 84521 invoked by uid 500); 22 May 2013 12:38:37 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 84486 invoked by uid 99); 22 May 2013 12:38:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 12:38:35 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.215.47] (HELO mail-la0-f47.google.com) (209.85.215.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 May 2013 12:38:28 +0000 Received: by mail-la0-f47.google.com with SMTP id fq12so1868934lab.6 for ; Wed, 22 May 2013 05:37:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:content-type:content-transfer-encoding:date:subject:to :message-id:mime-version:x-mailer:x-gm-message-state; bh=N5hiXI/h/+jyvk1azEO9d9/l48R4sYc2M9ZEqKQxMSU=; b=R3+Zny6vs33btRv5kdku3Tky5JHLT8J5mrbmPPgxkvYMVCTWbExl7/DRSjIAo8SAOg Sk3sdzMwxnB2YTmCFy5Z/1EjUqFc83eSsOZNwUqSeLe8eO4Zl53l+t+UDy++LKe5YOuN OxcdYEAqA7vsEh6jCl61eZLCj+KriMKPkpJAkCsIDN+CcMJTgHB25Hre7OxA7/Gmw/4w FZ6beKopG5NgUlJzYuA8TV8FN4e/jp8cNcMgb+68OevgnTcw7E3H4dl+NaOu8xz9ajY+ OR2Y7oWG1CcV0+3RA1LRAZDRbmmRxHNuXWrGNiOskp07832Ok9cQqDbwTA2RgoGg/3sm VVxQ== X-Received: by 10.112.89.195 with SMTP id bq3mr4017074lbb.19.1369226268030; Wed, 22 May 2013 05:37:48 -0700 (PDT) Received: from banarne.felles.ds.nrk.no ([160.67.145.156]) by mx.google.com with ESMTPSA id w3sm2920725lae.7.2013.05.22.05.37.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 22 May 2013 05:37:47 -0700 (PDT) From: Karl Wettin Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Date: Wed, 22 May 2013 14:37:46 +0200 Subject: =?iso-8859-1?Q?Bl=E5b=E6rsyltet=F8y_v=2Es=2E_R=E4ksm=F6rg=E5s?= To: java-user@lucene.apache.org Message-Id: Mime-Version: 1.0 (Apple Message framework v1283) X-Mailer: Apple Mail (2.1283) X-Gm-Message-State: ALoCoQmk9X0p01oqmDab1WwONAhsl/DkYnl612nlof8EufQeI/BqdU/jBX3fPr2c6jQpS8guvG9E X-Virus-Checked: Checked by ClamAV on apache.org This is a question (or perhaps a line of thought) regarding the mutually = intelligible Scandinavian languages Danish, Norwegian and Swedish. The Swedish letters =E5=E4=F6 is in fact the same letters as the = Danish/Norwegian =E5=E6=F8. A Norwegian writing about the Swedish city = of G=F6teborg write G=F8teborg and a Swedish person writing about = Svolv=E6r will write Svolv=E4r. This is easy to fix, I can just index = synonyms where =E4=F6 is replaced by =E6=F8 and vice verse. More problematic, at least in my head, is ASCII-folding. When a Swedish person is lacking umlauted characters on the keyboard = they consistently type a, a, o instead of =E5, =E4, =F6. Foreigners also = tend to use a, a, o.=20 In Norway people tend to type aa, ae and oe instead of =E5, =E6 and =F8. = Some use a, a, o. I've also seen oo, ao, etc. And permutations. Not sure = about Denmark but the pattern is probably the same. I have no clue to = what letters foreigners might be replacing them with. There's a lot of mismatch here. For a start ASCIIFoldingFilter translate = '=E4' to 'a' and '=E6' as 'ae'. The rest is not aligned with what people = actually type, such as '=F8' to 'o' rather than the more common 'oe'. I'm considering: * Forking ASCIIFoldingFilter with a bunch of strategies and index = permutations of synonyms. or * Use a filter after ASCIIFoldingFilter that discriminate all use of ae, = oe, oo, and other combination of double vowels, just keeping the first = one. Anyone else that thought about this? karl= --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org