lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Should ASCIIFoldingFilter be deprecated?
Date Tue, 08 Feb 2011 14:50:49 GMT
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
<DSMILEY@mitre.org> wrote:

> I'm skeptical that whatever the difference is is relevant in the scheme of
> things. The cost to keeping it is introducing confusion on users, and more
> code to maintain.
>

its pretty significant. charfilters are not reusable, and box every
character and lookup out of a hashmap (i made a patch to fix the
reusability, but no one has commented) :
https://issues.apache.org/jira/browse/LUCENE-2788

asciifoldingfilter does a huge switch (which still isnt optimal), but
its way way faster than mappingcharfilter, especially since its a
no-op for chars < 0x7F.

icufoldingfilter precompiles a recursively decomposed trie, so its
lookup is a unicode folded trie
(icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad
slower than asciifoldingfilter but it also incorporates case folding
and unicode normalization: neither asciifoldingfilter nor
mappingcharfilter will not properly fold
http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7aby&country=,
because there is no composed form for Z + combining cedilla, but
icufoldingfilter will.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message