lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Turcotte" <pat...@gmail.com>
Subject Re: whats the correct way to do normalisation?
Date Mon, 06 Nov 2006 16:03:12 GMT
Hi,

Did you take a look at IsoLatin1AccentFilter ?

Patrick

On 11/6/06, hans meiser <fischauto333@yahoo.de> wrote:
>
> Hi,
>
>   Lucene indexes documents from 3 different countries here
> (English, German and French). I want to normalize some
> characters like umlauts. ä -> ae
>   I did it in the following way:
>   New Analyzer:
> public class SpecialCharsAnalyzer extends StandardAnalyzer {
> public SpecialCharsAnalyzer() {
> }
>    public SpecialCharsAnalyzer(Set stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(String[] stopWords) {
>   super(stopWords);
> }
>    public SpecialCharsAnalyzer(File stopwords) throws IOException {
>   super(stopwords);
> }
>    public SpecialCharsAnalyzer(Reader stopwords) throws IOException {
>   super(stopwords);
> }
>    @Override
> public TokenStream tokenStream(String fieldName, Reader reader) {
>     TokenStream ts = super.tokenStream(fieldName, reader);
>   ts = new SpecialCharacterFilter(ts);
>   return ts;
> }
> }
>   Is the SpecialCharsAnalyzer::tokenStream implemented correctly?
>
> New Filter:
> public class SpecialCharacterFilter extends TokenFilter {
> public SpecialCharacterFilter(TokenStream input) {
>   super(input);
> }
>    @Override
> public Token next() throws IOException {
>   Token t = input.next();
>     if (t == null)
>    return null;
>     String str = t.termText();
>   if (str.indexOf("ä") != -1) {
>    str = str.replaceAll("ä", "ae");
>    t = new Token(str, t.startOffset(), t.endOffset() + 1);
>   }
>   return t;
> }
> }
>   Is the SpecialCharacterFilter::next implemented correctly,
> in case of the "ä"?
>
> Is this way the correct way to do normalisation?
>   thx
>
>
> ---------------------------------
> NEU: Fragen stellen - Wissen, Meinungen und Erfahrungen teilen. Jetzt auf
> Yahoo! Clever.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message