lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <jimi.hulleg...@svensktnaringsliv.se>
Subject ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?
Date Tue, 10 Jan 2017 06:02:43 GMT
Hi,

I wasn't happy with how our current solr configuration handled diacritics (like 'é') in the
text and in search queries, since it simply considered the letter with a diacritic as a distinct
letter. Ie 'é' didn't match 'e', and vice versa. Except for a handful rare words where the
diacritical sign in 'é' actually change the word meaning, it is usually used in names of
people and places and the expected behaivor when searching is to not have to type them and
still get the expected results (like searching for 'Penelope Cruz' and getting hits for 'Penélope
Cruz').

When reading online about how to handle diacritics in solr, it seems that the general recommendation,
when no language specific solution exists that handles this, is to use the ICUFoldingFilter.
However this filter doesn't really come with a lot of documentation, and doesn't seem to have
any configuration options at all (at least not documented).

So what I ended up with doing was simply to add the ICUFoldingFilterFactory in the middle
of the existing analyzer chain, like this:

<fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
                             <analyzer>
                                                          <charFilter class="solr.HTMLStripCharFilterFactory"
/>
                                                          <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([.])" replacement=" " />
                                                          <tokenizer class="solr.StandardTokenizerFactory"
/>
                                                          <filter class="solr.LowerCaseFilterFactory"
/>
                                                          <filter class="solr.KeywordRepeatFilterFactory"
/>
                                                          <filter class="solr.ICUFoldingFilterFactory"/>
                                                          <filter class="solr.SwedishLightStemFilterFactory"
/>
                                                          <filter class="solr.RemoveDuplicatesTokenFilterFactory"
/>
                             </analyzer>
</fieldType>


But that didn't really give me the results I want. For example, using the analysis debug tool
I see that the text 'café åäö' becomes 'cafe caf aao'. And there are two problems with
that result:

1. It doesn't respect keyword attribute
2. It folds the Swedish characters 'åäö' into 'aao'

The disregard of the keyword attribute is bad enough, but the mangling of the Swedish language
is really a show stopper for us. The Swedish language doesn't consider 'ö', for example,
to be the letter 'o' with two diacritical dots above it, just as 'Q' isn't considered to be
the letter 'O' with a diacritical "squiggly line" at the bottom. So when handling Swedish
text, these characters ('åäöÅÄÖ') shouldn't be folded, because then there will be to
many "collisions".

For example, when searching for 'påstå' ('claim'), one doesn't want hits about 'pasta' (you
guessed it, it means 'pasta'), just as one doesn't want to get hits about 'aga' ('corporal
punishment, usually against children') when searching for 'äga' ('to own'). Or even worse,
when searching för 'höra' ('to hear'), one most likely doesn't want hits about 'hora' ('prostitute').
And I can go on... :)

So, is there a way for us to make the ICUFoldingFilter work in a better way? Ie configure
it to respect the keyword attribute and ignore 'åäö' characters when folding, but otherwise
fold all diacritical characters into the non-diacritical form. Or how would you recommend
us to configure our analyzer chain to acomplice this?

Regards
/Jimi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message