lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Szűcs Roland <szucs.rol...@bookandwalk.hu>
Subject Re: Problem with solr suggester in case of non-ASCII characters
Date Tue, 30 Jul 2019 18:11:05 GMT
Hi Furkan,

Thanks the suggestion, I always forget the most effective debugging tool
the analysis page.

It turned out that "Jó" was a stop word and it was eliminated during the
text analysis. What I will do is to create a new field type but without
stop word removal and I will use it like this:
<str
name="suggestAnalyzerFieldType">short_text_hu_without_stop_removal</str>

Thanks again

Roland

Furkan KAMACI <furkankamaci@gmail.com> ezt írta (időpont: 2019. júl. 30.,
K, 16:17):

> Hi Roland,
>
> Could you check Analysis tab (
> https://lucene.apache.org/solr/guide/8_1/analysis-screen.html) and tell
> how
> the term is analyzed for both query and index?
>
> Kind Regards,
> Furkan KAMACI
>
> On Tue, Jul 30, 2019 at 4:50 PM Szűcs Roland <szucs.roland@bookandwalk.hu>
> wrote:
>
> > Hi All,
> >
> > I have an author suggester (searchcomponent and the related request
> > handler) defined in solrconfig:
> > <searchComponent name="suggest" class="solr.SuggestComponent">
> >     <!-- All suggester component must have different filepath to avoid
> >     write lock issues-->>
> >     <lst name="suggester">
> >       <str name="name">author</str>
> >       <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
> >       <str name="dictionaryImpl">DocumentDictionaryFactory</str>
> >       <str name="field">BOOK_productAuthor</str>
> >       <str name="suggestAnalyzerFieldType">short_text_hu</str>
> >       <str name="indexPath">suggester_infix_author</str>
> >       <str name="buildOnStartup">false</str>
> >       <str name="buildOnCommit">false</str>
> >       <str name="minPrefixChars">2</str>
> >     </lst>
> > </searchComponent>
> >
> > <requestHandler name="/suggesthandler" class="solr.SearchHandler"
> > startup="lazy" >
> > <lst name="defaults">
> >   <str name="suggest">true</str>
> >   <str name="suggest.count">10</str>
> >   <str name="suggest.dictionary">author</str>
> > </lst>
> > <arr name="components">
> >   <str>suggest</str>
> > </arr>
> > </requestHandler>
> >
> > Author field has just a minimal text processing in query and index time
> > based on the following definition:
> > <fieldType name="short_text_hu" class="solr.TextField"
> > positionIncrementGap="100" multiValued="true">
> >     <analyzer type="index">
> >       <charFilter class="solr.HTMLStripCharFilterFactory"/>
> >       <tokenizer class="solr.ClassicTokenizerFactory"/>
> >       <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> > ignoreCase="true"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >     </analyzer>
> >     <analyzer type="query">
> >       <tokenizer class="solr.ClassicTokenizerFactory"/>
> >       <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
> > ignoreCase="true"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >     </analyzer>
> >   </fieldType>
> >   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> > docValues="true"/>
> >   <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
> > docValues="true" multiValued="true"/>
> >   <fieldType name="text_ar" class="solr.TextField"
> > positionIncrementGap="100">
> >     <analyzer>
> >       <tokenizer class="solr.StandardTokenizerFactory"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.StopFilterFactory"
> words="lang/stopwords_ar.txt"
> > ignoreCase="true"/>
> >       <filter class="solr.ArabicNormalizationFilterFactory"/>
> >       <filter class="solr.ArabicStemFilterFactory"/>
> >     </analyzer>
> >   </fieldType>
> >
> > When I use qeries with only ASCII characters, the results are correct:
> > "Al":{
> > "term":"<b>Al</b>exandre Dumas", "weight":0, "payload":""}
> >
> > When I try it with Hungarian authorname with special character:
> > "Jó":"author":{
> > "Jó":{ "numFound":0, "suggestions":[]}}
> >
> > When I try it with three letters, it works again:
> > "Józ":"author":{
> > "Józ":{ "numFound":10, "suggestions":[{ "term":"Bajza <b>Józ</b>sef",
"
> > weight":0, "payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
"
> > payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
> "payload":""}, {
> > "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
> > "term":"<b>Józ</b>sef
> > Attila", "weight":0, "payload":""}..
> >
> > Any idea how can it happen that a longer string has more matches than a
> > shorter one. It is inconsistent. What can I do to fix it as it would
> > results poor customer experience.
> > They would feel that sometimes they need 2 sometimes 3 characters to get
> > suggestions.
> >
> > Thanks in advance,
> > Roland
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message