lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: ClassicAnalyzer Behavior on accent character
Date Thu, 19 Oct 2017 18:40:04 GMT
easy, don't use classictokenizer: use standardtokenizer instead.

On Thu, Oct 19, 2017 at 9:37 AM, Chitra <chithu.r111@gmail.com> wrote:
> Hi,
>               I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
> indexed as "er l n", some characters were trimmed while indexing.
>
> Here is my code
>
> protected Analyzer.TokenStreamComponents createComponents(final String
>> fieldName, final Reader reader)
>>     {
>>         final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
>> reader);
>>         src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>>
>>         TokenStream tok = new ClassicFilter(src);
>>         tok = new LowerCaseFilter(getVersion(), tok);
>>         tok = new StopFilter(getVersion(), tok, stopwords);
>>         tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
>> search
>>
>>         return new Analyzer.TokenStreamComponents(src, tok)
>>         {
>>             @Override
>>             protected void setReader(final Reader reader) throws
>> IOException
>>             {
>>
>> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>>                 super.setReader(reader);
>>             }
>>         };
>>     }
>
>
>
> Am I missing anything? Is that expected behavior for my input or any reason
> behind such abnormal behavior?
>
> --
> Regards,
> Chitra

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message