Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1593925030.1245166867353.JavaMail.jira@brutus>
Date: Tue, 16 Jun 2009 08:41:07 -0700 (PDT)
From: "Robert Muir (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1696) Added New Token API impl for
 ASCIIFoldingFilter
In-Reply-To: <2059492571.1245163027417.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/LUCENE-1696?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D127=
20193#action_12720193 ]=20

Robert Muir commented on LUCENE-1696:
-------------------------------------

simon, actually i think its documented you can use ENGLISH collator and it =
will behave like asciifolding filter (simply remove all diacritics).
you could then apply the tailorings like the example and get the behavior y=
ou want, versus maintaining a custom asciifoldingfilter...

> Added New Token API impl for ASCIIFoldingFilter
> -----------------------------------------------
>
>                 Key: LUCENE-1696
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1696
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Simon Willnauer
>            Assignee: Mark Miller
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter._newTokenAPI.patch, TestGermanCol=
lation.java
>
>
> I added an implementation of incrementToken to ASCIIFoldingFilter.java an=
d extended the existing  testcase for it.
> I will attach the patch shortly.
> Beside this improvement I would like to start up a small discussion about=
 this filter. ASCIIFoldingFitler is meant to be a replacement for ISOLatin1=
AccentFilter which is quite nice as it covers a superset of the latter. I h=
ave used this filter quite often but never on a as it is basis. In the most=
 cases this filter does the correct thing (replace a special char with its =
ascii correspondent) but in some cases like for German umlaut it does not r=
eturn the expected result. A german umlaut  like '=C3=A4' does not translat=
e to a but rather to 'ae'. I would like to change this but I'n not 100% sur=
e if that is expected by all users of that filter. Another way of doing it =
would be to make it configurable with a flag. This would not affect perform=
ance as we only check if such a umlaut char is found.=20
> Further it would be really helpful if that filter could "inject" the orig=
inal/unmodified token with the same position increment into the token strea=
m on demand. I think its a valid use-case to index the modified and unmodif=
ied token. For instance, the german word "s=C3=BCd" would be folded to "sud=
". In a query q:(s=C3=BCd) the filter would also fold to sud and therefore =
find sud which has a totally different meaning. Folding works quite well bu=
t for special cases would could add those options to make users life easier=
. The latter could be done in a subclass while the umlaut problem should be=
 fixed in the base class.
> simon=20

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org