lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <iori...@yahoo.com.INVALID>
Subject Re: Preprocess input text before tokenizing
Date Thu, 23 Jun 2016 18:28:09 GMT
Hi,

Zero or more CharFilter(s) is the way to manipulate text before the tokenizer.
I think init reader is the method you want to plug char filters.
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/morfologik/src/java/org/apache/lucene/analysis/uk/UkrainianMorfologikAnalyzer.java

Ahmet

On Thursday, June 23, 2016 6:47 PM, Jaime <j.pardos@estructure.es> wrote:
Hello,

I want to change the input text before tokenizing. I think I just need 
to use some characters as word separators, and maybe remove some others 
completely.

I was planning to use MappingCharFilterFactory to replace some chars 
with " " and others with "", but I feel like I'm not in the right track.

First, I've implemented a custom analyzer to use my custom tokenizer. My 
idea was to inherit from StandardTokenizer and, in setReader, calling 
MappingCharFilterFactory.create(reader) from within.

However, setReader is final, so I can't override it.

Is there a better way to do this?
In any case, how should I use MappingCharFilter in case I really needed it?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message