lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tobias Ibounig <t.ibou...@netconomy.net>
Subject RE: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter
Date Fri, 19 Jul 2019 09:54:37 GMT
Hi Doris,

Are you sure you want 'ä' --> 'ae'
If you check, the German stemmers usually substitute ä --> a (to "reduce over stemming"
[1]), so you would be working against the stemmers logic here.

If you take a look at the GermanNormalizationFilter, it even substitutes 'ae' with 'a' [2].

Would recommend to use the default evaluable tools if you don't have a specific requirement
against it.

All the Best
Tobias

[1] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanStemmer.java#L164

[2] https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/de/GermanNormalizationFilter.java#L31

-----Original Message-----
From: Doris Peter <Doris.Peter@bsb-muenchen.de> 
Sent: Freitag, 19. Juli 2019 11:13
To: solr-user@lucene.apache.org
Subject: Antw: Re: Correct order of mappinCharFilter, Tokenizer and GermanStemFilter

Thanks for the answer. I examined the  ICUFoldingFilterFactory, but it seems to me, that it
can't be customized the way I would need it.
We have got some special foldings, e.g.: ä->ae. In the CharFilter, I can add it to the
following file: "mapping="mapping-FoldToASCII.txt"
There seems to be nothing like this mapping file in the ICUFoldingFilter? Exclusion is not
enough ....




 
>>> Shawn Heisey <apache@elyograg.org> 7/18/2019 3:08 PM >>>
On 7/18/2019 3:01 AM, Doris Peter wrote:
> So, the mappingCharFilter seems to be executed at first, no matter which position it
has in the configuration?

CharFilters are always executed first.  Then one Tokenizer, then Filters.  This will always
be the case, even if you order the config so that the Tokenizer and one or more Filters are
listed before CharFilter entries.  It's one of the quirks of analysis definitions.

The fix for this would be to see if there is a regular Filter that does what the CharFilter
you're using does and use that filter instead.

If it were me, I would likely use ICUFoldingFilterFactory rather than MappingCharFilterFactory.
 The ICU analysis components do require installing contrib jars into Solr.

https://lucene.apache.org/solr/guide/8_1/filter-descriptions.html#icu-folding-filter

Thanks,
Shawn


Mime
View raw message