From java-user-return-64306-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Tue Apr 16 18:46:35 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 71A6B180630 for ; Tue, 16 Apr 2019 20:46:35 +0200 (CEST) Received: (qmail 91834 invoked by uid 500); 16 Apr 2019 18:46:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 91796 invoked by uid 99); 16 Apr 2019 18:46:32 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Apr 2019 18:46:32 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B073C1810A4 for ; Tue, 16 Apr 2019 18:46:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.601 X-Spam-Level: X-Spam-Status: No, score=0.601 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, KAM_ASCII_DIVIDERS=0.8, MIME_QP_LONG_LINE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=openindex.io Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id E5nn_8iZZihS for ; Tue, 16 Apr 2019 18:46:29 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (mail1.ams.nl.openindex.io [141.105.125.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 6D14B60CDB for ; Tue, 16 Apr 2019 18:46:29 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id 483DF38061E for ; Tue, 16 Apr 2019 18:45:53 +0000 (UTC) Received: from mail1.ams.nl.openindex.io ([127.0.0.1]) by localhost (mail1.ams.nl.openindex.io [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7yX-UrLsFjKh for ; Tue, 16 Apr 2019 18:45:53 +0000 (UTC) Received: from mail1.ams.nl.openindex.io (localhost [127.0.0.1]) by mail1.ams.nl.openindex.io (Postfix) with ESMTP id C74A4380555 for ; Tue, 16 Apr 2019 18:45:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=openindex.io; s=mail; t=1555440352; bh=f2ZVRtlM4hGMhXMjWd9lF4p7xK+xYGWtkXSuY/DyyRY=; h=Subject:From:To:Date:From; b=BCCJO+KZBdB57pJ1uwwL9uS3MshoL0wgcP3JxGSmCaUH5M1JBBQk5KdPMOwuFwaHp +jEER3XJBK0kT8M5WzyKofTW8o7noNN/nvne2Ue9Tn/aiHZ+cG08dE/rgdMZ5b3X8M XLL+Y3sgvUnf7EyJN6PLu4fcbcDjn44daeOUYIDHwuLXAvJXJEtVjSpFerTAXi3Az5 iyZiZJGYNR70Nj6EzP2j01idqzAlcDqHbyzb9RGVQl1MIXjLzwhq/a3kmx+QNHR4tV jZXmpK7weKEcBGKHJYWwx3tF3HL7TYoaEvm3QBesyDNhejQw9uScD/kuuHlfSxvmmq 8afENLKTCq89Q== Subject: RE: umlauts / diacritic expansion From: =?utf-8?Q?Markus_Jelsma?= To: =?utf-8?Q?java-user=40lucene=2Eapache=2Eorg?= Date: Tue, 16 Apr 2019 18:45:52 +0000 Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-Mailer: Zarafa 7.2.1-51838 X-Original-To: Message-Id: Hello Michael, For the case of normalizing =C3=BC to ue, take a look at the german normalizer [1]. Regards, Markus [1] https://lucene.apache.org/core/7_6_0/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html =20 =20 -----Original message----- > From:Ralf Heyde > Sent: Tuesday 16th April 2019 20:28 > To: java-user@lucene.apache.org > Subject: Re: umlauts / diacritic expansion >=20 > Hey, >=20 > Take a look at Asciifoldingfilter - this one is quite generic. >=20 > Does this answer your question=3F >=20 > Cheers Ralf >=20 > Von meinem iPhone gesendet >=20 > > Am 16.04.2019 um 20:08 schrieb Michael Sokolov : > >=20 > > I'm learning how to index/search German today and understanding that > > vowels with umlauts are conventionally expanded into two ASCII > > characters, eg=C2=A0 "f=C3=BCr" -> "fuer", so people may search for the expanded > > form "fuer", but they might also search with the diacritic, and > > finally they might lazily search using the stripped form "fur". > >=20 > > My question: is there a standard CharFilter or TokenFilter that > > expands to both (ASCII) forms, for characters with umlauts and perhaps > > other diacritics I might be unaware of in other languages having > > similar multiple renderings in ASCII=3F > >=20 > > -Mike > >=20 > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > >=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org