lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Sokolov <msoko...@gmail.com>
Subject Re: ClassicAnalyzer Behavior on accent character
Date Tue, 28 Nov 2017 23:26:59 GMT
That's expected. Non letters are not mapped to letters, correctly.

On Oct 19, 2017 9:38 AM, "Chitra" <chithu.r111@gmail.com> wrote:

> Hi,
>               I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
> indexed as "er l n", some characters were trimmed while indexing.
>
> Here is my code
>
> protected Analyzer.TokenStreamComponents createComponents(final String
> > fieldName, final Reader reader)
> >     {
> >         final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> > reader);
> >         src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >
> >         TokenStream tok = new ClassicFilter(src);
> >         tok = new LowerCaseFilter(getVersion(), tok);
> >         tok = new StopFilter(getVersion(), tok, stopwords);
> >         tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> > search
> >
> >         return new Analyzer.TokenStreamComponents(src, tok)
> >         {
> >             @Override
> >             protected void setReader(final Reader reader) throws
> > IOException
> >             {
> >
> > src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
> >                 super.setReader(reader);
> >             }
> >         };
> >     }
>
>
>
> Am I missing anything? Is that expected behavior for my input or any reason
> behind such abnormal behavior?
>
> --
> Regards,
> Chitra
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message