lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Solr search engine configuration
Date Mon, 12 Mar 2018 16:04:58 GMT
Peter:

bq: I don't have a requestHandler named "/select".

Right, that was just an example of a request handler, your
"/scoresearch" handler _does_ have edismax as your default "defType"
so assuming you're using that one it makes no difference at all
whether you specify &defType=edismax on the URL or not. You'd see a
differences if you specified "&defType=any_parser_other_than_dismax"
though ;)

As for the rest, I'll leave you in the much more capable hands of
Markus since he has, you know, real knowledge in this area rather than
my generalities....

Best,
Erick

On Mon, Mar 12, 2018 at 1:33 AM, Markus Jelsma
<markus.jelsma@openindex.io> wrote:
> Hi,
>
> Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to solve all problems
either, for example molens => molen, but molen => mool, and many more like that. You
can solve this by adding manual rules to StemmerOverrideFilter, but due to the compound nature
of words, you would need to add it for all the mills.
>
> Regarding the compounds, Dutch is (more or less) just another Germanic language and uses
compounds just like German, Swedish etc. To deal with that you can try the vanilla HyphenationCompoundWordTokenFilter
(or something like that). Be sure not to set minWordLength too low, or you'll get plenty of
bad results. The major drawback of this token filter is that it emits overlapping terms, and
may not always work with compounds of which the head is a plural, just like dierenzaak, of
scholierenkorting.
>
> Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or you may have
trouble finding a café.
>
> Regards,
> Markus
>
> -----Original message-----
>> From:PeterKerk <petervdkerk@hotmail.com>
>> Sent: Sunday 11th March 2018 23:55
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr search engine configuration
>>
>> Sorry for this lengthy post, but I wanted to be complete.
>>
>> The only occurence of edismax in solrconfig.xml is this one:
>>
>>       <requestHandler name="/scoresearch" class="solr.SearchHandler"
>> default="true">
>>
>>                       <lst name="defaults">
>>                         <str name="defType">edismax</str>
>>                         <str name="echoParams">explicit</str>
>>                         <int name="rows">10</int>
>>
>>                         <str name="qf">double_score</str>
>>                         <str name="debug">false</str>
>>                         <str name="q.alt">*:*</str>
>>               </lst>
>>       </requestHandler>
>>
>> I don't have a requestHandler named "/select".
>>
>>
>> Also, removing the gramming definitely helped! :-)
>>
>> I tried to simplify my setup first and then expand, so what I have now is
>> this:
>>
>>
>>       <fieldType name="searchtext_nl" class="solr.TextField"
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_nl.txt"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.SnowballPorterFilterFactory" language="Kp"
>> protected="protwords_nl.txt"></filter>
>>
>>
>>       </analyzer>
>>       <analyzer type="query">
>>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>               <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_nl.txt"/>
>>               <filter class="solr.LowerCaseFilterFactory"/>
>>               <filter class="solr.SnowballPorterFilterFactory" language="Kp"
>> protected="protwords_nl.txt"></filter>
>>
>>
>>       </analyzer>
>>     </fieldType>
>>
>>       <field name="title_search_global" type="searchtext_nl" indexed="true"
>> stored="true"/>
>>
>> In my database I have these 4 values for "title" that populate
>> "title_search_global"
>>
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> ps. "dier" is singular of plural "dieren".
>>
>> Using this query:
>> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true
>>
>> These results are found:
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>>
>> And these are NOT:
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> I'd expect it should be fairly easy (although I don't know how) to also
>> include result "dierenzaak", by compounding the 2 query values. And yes you
>> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
>> sure what logic would also include "dierzaak"
>>
>> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
>> exact match of "dieren zaak"
>> So I also checked the usage of pf parameters with edismax (based on these
>> links:
>> https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
>> http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
>> And also for dismax:
>> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter
>>
>> But I can't find any examples how to actually use these parameters?
>>
>>
>> The search results, including debug info is here:
>>
>>
>> <response>
>>     <lst name="responseHeader">
>>         <int name="status">0</int>
>>         <int name="QTime">7</int>
>>         <lst name="params">
>>             <str name="q">title_search_global:(dieren zaak)</str>
>>             <str name="defType">edismax</str>
>>             <str name="debug">true</str>
>>             <str name="indent">true</str>
>>             <str name="qf">title_search_global</str>
>>             <str name="fl">id,title</str>
>>             <str name="fq">(lang:"nl" OR lang:"all")</str>
>>             <str name="wt">xml</str>
>>             <str name="lowercaseOperators">true</str>
>>             <str name="stopwords">true</str>
>>         </lst>
>>     </lst>
>>     <result name="response" numFound="2" start="0">
>>         <doc>
>>             <str name="title">dieren zaak</str>
>>             <str name="id">115_3699638</str>
>>         </doc>
>>         <doc>
>>             <str name="title">dier</str>
>>             <str name="id">115_3699637</str>
>>         </doc>
>>     </result>
>>     <lst name="debug">
>>         <str name="rawquerystring">title_search_global:(dieren zaak)</str>
>>         <str name="querystring">title_search_global:(dieren zaak)</str>
>>         <str name="parsedquery">
>> (+(title_search_global:dier title_search_global:zaak))/no_coord
>> </str>
>>         <str name="parsedquery_toString">
>> +(title_search_global:dier title_search_global:zaak)
>> </str>
>>         <lst name="explain">
>>             <str name="115_3699638">
>> 5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
>> weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
>> 2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
>> queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
>> = queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
>> with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
>> 0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
>> in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
>> termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
>> idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
>> in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
>> 6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)
>> </str>
>>             <str name="115_3699637">
>> 1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
>> (MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
>> of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
>> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
>> maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
>> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
>> idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)
>> </str>
>>             <str name="110_141">
>> 0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
>> (MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
>> of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
>> 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
>> maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
>> of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
>> idf(docFreq=3, maxDocs=513) 0.5 = fieldNorm(doc=132) 0.5 = coord(1/2)
>> </str>
>>         </lst>
>>         <str name="QParser">ExtendedDismaxQParser</str>
>>         <null name="altquerystring" />
>>         <null name="boost_queries" />
>>         <arr name="parsed_boost_queries" />
>>         <null name="boostfuncs" />
>>         <arr name="filter_queries">
>>             <str>(lang:"nl" OR lang:"all")</str>
>>         </arr>
>>         <arr name="parsed_filter_queries">
>>             <str>lang:nl lang:all</str>
>>         </arr>
>>         <lst name="timing">
>>             <double name="time">7.0</double>
>>             <lst name="prepare">
>>                 <double name="time">4.0</double>
>>                 <lst name="query">
>>                     <double name="time">4.0</double>
>>                 </lst>
>>                 <lst name="facet">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="mlt">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="highlight">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="stats">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="debug">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>             </lst>
>>             <lst name="process">
>>                 <double name="time">3.0</double>
>>                 <lst name="query">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="facet">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="mlt">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="highlight">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="stats">
>>                     <double name="time">0.0</double>
>>                 </lst>
>>                 <lst name="debug">
>>                     <double name="time">3.0</double>
>>                 </lst>
>>             </lst>
>>         </lst>
>>     </lst>
>> </response>
>>
>>
>> PS. had to laugh out loud about that professor joke :-D
>>
>>
>>
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>

Mime
View raw message