lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Problem with solr suggester in case of non-ASCII characters
Date Wed, 31 Jul 2019 12:23:48 GMT
Roland:

Have you considered just not using stopwords anywhere? Largely they’re a holdover
from a long time ago when every byte counted. Plus using stopwords has “interesting”
issues with things like highlighting and phrase queries and the like.

Sure, not using stopwords will make your index larger, but so will a copyfield…

Your call of course, but stopwords are over-used IMO.

I’m stealing Walter Underwood’s thunder here ;)

Best,
Erick

> On Jul 30, 2019, at 2:11 PM, Szűcs Roland <szucs.roland@bookandwalk.hu> wrote:
> 
> Hi Furkan,
> 
> Thanks the suggestion, I always forget the most effective debugging tool
> the analysis page.
> 
> It turned out that "Jó" was a stop word and it was eliminated during the
> text analysis. What I will do is to create a new field type but without
> stop word removal and I will use it like this:
> <str
> name="suggestAnalyzerFieldType">short_text_hu_without_stop_removal</str>
> 
> Thanks again
> 
> Roland
> 
> Furkan KAMACI <furkankamaci@gmail.com> ezt írta (időpont: 2019. júl. 30.,
> K, 16:17):
> 
>> Hi Roland,
>> 
>> Could you check Analysis tab (
>> https://lucene.apache.org/solr/guide/8_1/analysis-screen.html) and tell
>> how
>> the term is analyzed for both query and index?
>> 
>> Kind Regards,
>> Furkan KAMACI
>> 
>> On Tue, Jul 30, 2019 at 4:50 PM Szűcs Roland <szucs.roland@bookandwalk.hu>
>> wrote:
>> 
>>> Hi All,
>>> 
>>> I have an author suggester (searchcomponent and the related request
>>> handler) defined in solrconfig:
>>> <searchComponent name="suggest" class="solr.SuggestComponent">
>>>    <!-- All suggester component must have different filepath to avoid
>>>    write lock issues-->>
>>>    <lst name="suggester">
>>>      <str name="name">author</str>
>>>      <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
>>>      <str name="dictionaryImpl">DocumentDictionaryFactory</str>
>>>      <str name="field">BOOK_productAuthor</str>
>>>      <str name="suggestAnalyzerFieldType">short_text_hu</str>
>>>      <str name="indexPath">suggester_infix_author</str>
>>>      <str name="buildOnStartup">false</str>
>>>      <str name="buildOnCommit">false</str>
>>>      <str name="minPrefixChars">2</str>
>>>    </lst>
>>> </searchComponent>
>>> 
>>> <requestHandler name="/suggesthandler" class="solr.SearchHandler"
>>> startup="lazy" >
>>> <lst name="defaults">
>>>  <str name="suggest">true</str>
>>>  <str name="suggest.count">10</str>
>>>  <str name="suggest.dictionary">author</str>
>>> </lst>
>>> <arr name="components">
>>>  <str>suggest</str>
>>> </arr>
>>> </requestHandler>
>>> 
>>> Author field has just a minimal text processing in query and index time
>>> based on the following definition:
>>> <fieldType name="short_text_hu" class="solr.TextField"
>>> positionIncrementGap="100" multiValued="true">
>>>    <analyzer type="index">
>>>      <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>      <tokenizer class="solr.ClassicTokenizerFactory"/>
>>>      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
>>> ignoreCase="true"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer class="solr.ClassicTokenizerFactory"/>
>>>      <filter class="solr.StopFilterFactory" words="stopwords_hu.txt"
>>> ignoreCase="true"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>    </analyzer>
>>>  </fieldType>
>>>  <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>>> docValues="true"/>
>>>  <fieldType name="strings" class="solr.StrField" sortMissingLast="true"
>>> docValues="true" multiValued="true"/>
>>>  <fieldType name="text_ar" class="solr.TextField"
>>> positionIncrementGap="100">
>>>    <analyzer>
>>>      <tokenizer class="solr.StandardTokenizerFactory"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="solr.StopFilterFactory"
>> words="lang/stopwords_ar.txt"
>>> ignoreCase="true"/>
>>>      <filter class="solr.ArabicNormalizationFilterFactory"/>
>>>      <filter class="solr.ArabicStemFilterFactory"/>
>>>    </analyzer>
>>>  </fieldType>
>>> 
>>> When I use qeries with only ASCII characters, the results are correct:
>>> "Al":{
>>> "term":"<b>Al</b>exandre Dumas", "weight":0, "payload":""}
>>> 
>>> When I try it with Hungarian authorname with special character:
>>> "Jó":"author":{
>>> "Jó":{ "numFound":0, "suggestions":[]}}
>>> 
>>> When I try it with three letters, it works again:
>>> "Józ":"author":{
>>> "Józ":{ "numFound":10, "suggestions":[{ "term":"Bajza <b>Józ</b>sef",
"
>>> weight":0, "payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
"
>>> payload":""}, { "term":"Eötvös <b>Józ</b>sef", "weight":0,
>> "payload":""}, {
>>> "term":"Eötvös <b>Józ</b>sef", "weight":0, "payload":""}, {
>>> "term":"<b>Józ</b>sef
>>> Attila", "weight":0, "payload":""}..
>>> 
>>> Any idea how can it happen that a longer string has more matches than a
>>> shorter one. It is inconsistent. What can I do to fix it as it would
>>> results poor customer experience.
>>> They would feel that sometimes they need 2 sometimes 3 characters to get
>>> suggestions.
>>> 
>>> Thanks in advance,
>>> Roland
>>> 
>> 


Mime
View raw message