lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Updated: (LUCENE-1343) A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical marks or non-spacing modifiers.
Date Tue, 20 Apr 2010 14:36:52 GMT


Robert Muir updated LUCENE-1343:

    Attachment: LUCENE-1343.patch

attached is a modified patch (i will upload the new datafile too).
* applied ICU or Unicode copyright headers to any datafiles where I sourced from their data,
and added a mention to NOTICE.txt to that effect.
* added some additional punctuation mappings to ensure it contains all ASCIIFoldingFilter

As noted previously, there are 5 places where this disagrees with ASCIIFoldingFilter:
U+2033: DOUBLE PRIME (should be two single quotes)
U+2036: REVERSED DOUBLE PRIME (same as above)
U+2038: CARET (folds to CIRCUMFLEX ACCENT, which should be deleted as its [:Diacritic:]

I plan to commit in a few days if no one objects.

> A replacement for AsciiFoldingFilter that does a more thorough job of removing diacritical
marks or non-spacing modifiers.
> --------------------------------------------------------------------------------------------------------------------------
>                 Key: LUCENE-1343
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 3.1
>            Reporter: Robert Haschart
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1
>         Attachments: LUCENE-1343.patch, LUCENE-1343.patch, normalizer.jar,,,, utr30.nrm, utr30.nrm
> The ISOLatin1AccentFilter takes Unicode characters that have diacritical marks and replaces
them with a version of that character with the diacritical mark removed.  For example é becomes
e.  However another equally valid way of representing an accented character in Unicode is
to have the unaccented character followed by a non-spacing modifier character (like this:
 é  )    The ISOLatin1AccentFilter doesn't handle the accents in decomposed unicode characters
at all.    Additionally there are some instances where a word will contain what looks like
an accented character, that is actually considered to be a separate unaccented character 
such as  Ł  but which to make searching easier you want to fold onto the latin1  lookalike
 version   L  .   
> The UnicodeNormalizationFilter can filter out accents and diacritical marks whether they
occur as composed characters or decomposed characters, it can also handle cases where as described
above characters that look like they have diacritics (but don't) are to be folded onto the
letter that they look like ( Ł  -> L )

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message