lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doris Peter" <Doris.Pe...@bsb-muenchen.de>
Subject Correct order of mappinCharFilter, Tokenizer and GermanStemFilter
Date Thu, 18 Jul 2019 09:01:16 GMT
Hi, 

another problem with the stemming:

Most of our texts are in German, so we use the GermanStemFilterFactory. But we also use MappingCharFilterFactory
where we map for example ä->ae.

But of course we want the stemming to turn for example 'häuser' into 'haus', which the GermanStemFilterFactory
should do, according to the documentation.

At the moment, my configuration looks like this:

    <fieldtype name="text_ocr" class="solr.TextField" termPositions="true" termVectors="true"
termPayloads="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.GermanStemFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" delimiter="⚑"
          encoder="org.mdz.search.solrocr.lucene.byteoffset.ByteOffsetEncoder" />
        <filter class="solr.WordDelimiterGraphFilterFactory" protected="protectedword.txt"
             preserveOriginal="0" splitOnNumerics="1" splitOnCaseChange="0"
             catenateWords="1" catenateNumbers="1" catenateAll="1"
             generateWordParts="1" generateNumberParts="1" stemEnglishPossessive="1"
             types="wdfftypes.txt" />
      </analyzer>
    </fieldtype>

So, Stemming before CharFilter.

But the Solr Analyzer says:

MCF 0 h a e u s e r

WT
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
	
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
LCF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
	
haeuser
[68 61 65 75 73 65 72]
0
6
1
word
1
1
GSF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false
DPTF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false
WDGF
	
text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword
payload
	
haeu
[68 61 65 75]
0
6
1
word
1
1
false

So, the mappingCharFilter seems to be executed at first, no matter which position it has in
the configuration?

Solr documentation also says, it should be put before the Tokenizer:
https://lucene.apache.org/solr/guide/7_6/charfilterfactories.html
"CharFilters can be chained like Token Filters and placed in front of a Tokenizer."

But if the word häuser is changed to haeuser, the stemmer doesn't stem the word anymore :-/

Is there a way to solve this problem?

Thanks a lot, Doris



Mime
View raw message