lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chitra <chithu.r...@gmail.com>
Subject ClassicAnalyzer Behavior on accent character
Date Thu, 19 Oct 2017 13:37:36 GMT
Hi,
              I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
indexed as "er l n", some characters were trimmed while indexing.

Here is my code

protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader)
>     {
>         final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> reader);
>         src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>
>         TokenStream tok = new ClassicFilter(src);
>         tok = new LowerCaseFilter(getVersion(), tok);
>         tok = new StopFilter(getVersion(), tok, stopwords);
>         tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> search
>
>         return new Analyzer.TokenStreamComponents(src, tok)
>         {
>             @Override
>             protected void setReader(final Reader reader) throws
> IOException
>             {
>
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>                 super.setReader(reader);
>             }
>         };
>     }



Am I missing anything? Is that expected behavior for my input or any reason
behind such abnormal behavior?

-- 
Regards,
Chitra

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message