lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Woodward" <d...@loc.gov>
Subject Unicode Normalization
Date Wed, 11 Apr 2007 20:00:40 GMT
Hi.

I have encountered a problem searching in my application because of inconsistant unicode normalization
forms in the corpus (and the queries). I would like to normalize to form NFKD in an analyzer
(I think). I was thinking about creating a filter similar to the lowercasefilter that would
do the unicode normalization. Then I will add that filter to my existing snowball analyzer.
I am about to embark on creating said analyzer/filter using the ICU (http://icu-project.org/)
icu4j jar.

Is this already accounted for in standard lucene somewhere and I'm just missing it?

Anything similar out there?

Any other advice?

Thanks,
Dave Wooodward
Library of Congress


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message