lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-1390) add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
Date Mon, 29 Sep 2008 23:54:44 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Steven Rowe updated LUCENE-1390:
--------------------------------

    Attachment: ASCIIFoldingFilter.patch

Changes from Andi's version:

# Changed the name of the class to ASCIIFoldingFilter
# Added the Unicode chracter descriptions to comments on each character
# Added a test class
# Added several other Unicode blocks from which characters are converted to their ASCII equivalents.
 Added characters include digits and punctuation.

I did not provide mappings for characters from the Math block - flattening circled plus, for
example, didn't seem appropriate.

I *did* provide mappings for IPA and two other phonetic character blocks, and I'm not sure
whether this is appropriate.  I was following what seemed to me to be the logic of Andi's
mappings, and those provided by Latin1AccentFilter: convert characters to those that *look
like* them in ASCII.  As a result, e.g., the character described as "LATIN SMALL LETTER TURNED
M" (U+0270) from the IPA block is mapped to "m", regardless of its actual phonetic value.

There are lots of mappings in there now.  I generated the mappings by Perl scripting over
the contents of the Unicode 5.1 version of UnicodeData.txt from Unicode.org, after grep'ing
e.g. for "LATIN" and "LETTER" or "DIGRAPH", etc., and then moved things around to the appropriate
places by hand.  I guess this is one weakness of this patch: it's large enough that manual
verification is tough.  It's my hope that adding the Unicode character descriptions will allow
for at least improved verifiability.

> add ISOLatinAccentFilter and deprecate ISOLatin1AccentFilter
> ------------------------------------------------------------
>
>                 Key: LUCENE-1390
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1390
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>         Environment: any
>            Reporter: Andi Vajda
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: ASCIIFoldingFilter.patch, ISOLatinAccentFilter.java
>
>
> The ISOLatin1AccentFilter is removing accents from accented characters in the ISO Latin
1 character set.
> It does what it does and there is no bug with it.
> It would be nicer, though, if there was a more comprehensive version of this code that
included not just ISO-Latin-1 (ISO-8859-1) but the entire Latin 1 and Latin Extended A unicode
blocks.
> See: http://en.wikipedia.org/wiki/Latin-1_Supplement_unicode_block
> See: http://en.wikipedia.org/wiki/Latin_Extended-A_unicode_block
> That way, all languages using roman characters are covered.
> A new class, ISOLatinAccentFilter is attached. It is intended to supercede ISOLatin1AccentFilter
which should get deprecated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message