lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Naomi Dushay <ndus...@stanford.edu>
Subject mm, tie, qs, ps and CJKBigramFilter and edismax and dismax
Date Wed, 04 Sep 2013 00:54:50 GMT
When I have a field using CJKBigramFilter,  parsed CJK chars have a different parsedQuery than
 non-CJK  queries.

  (旧小说 is 3 chars, so 2 bigrams)

args sent in:       q={!qf=bi_fld}旧小说&pf=&pf2=&pf3=

 debugQuery
   <str name="rawquerystring">{!qf=bi_fld}旧小说</str>
   <str name="querystring">{!qf=bi_fld}旧小说</str>
   <str name="parsedquery">(+DisjunctionMaxQuery((((bi_fld:旧小 bi_fld:小说)~2))~0.01)
())/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()</str>


If i use a non-CJK query string, with the same field:

args sent in:      q={!qf=bi_fld}foo bar&pf=&pf2=&pf3=

debugQuery:
   <str name="rawquerystring">{!qf=bi_fld}foo bar</str>
   <str name="querystring">{!qf=bi_fld}foo bar</str>
   <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord</str>
   <str name="parsedquery_toString">+(((bi_fld:foo)~0.01 (bi_fld:bar)~0.01)~2)</str>


Why are the  parsedquery_toString   formula different?  And is there any difference in the
actual relevancy formula?    

How can you tell the difference between the MinNrShouldMatch and a qs or ps or tie value,
if they are all represented as ~n  in the parsedQuery string?


To try to get a handle on qs, ps, tie and mm:

 args:  q={!qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01)
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3)
(bi_fld:"c d"~4)~0.01</str>


I get that qs, the query slop, is for explicit phrases in the query, so "a b"~5    makes sense.
  I also get that ps is for boosting of phrases, so I get  (bi_fld:"c d"~4) … but where
is   (cjk_uni_pub_search:"a b c d"~4)  ?


Using dismax (instead of edismax):

args:   q={!dismax  qf=bi_fld pf=bi_fld}"a b" c d&qs=5&ps=4

debugQuery:
  <str name="rawquerystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="querystring">{!dismax qf=bi_fld pf=bi_fld}"a b" c d</str>
  <str name="parsedquery">(+((DisjunctionMaxQuery((bi_fld:"a b"~5)~0.01) DisjunctionMaxQuery((bi_fld:c)~0.01)
DisjunctionMaxQuery((bi_fld:d)~0.01))~3) DisjunctionMaxQuery((bi_fld:"a b c d"~4)~0.01))/no_coord</str>
  <str name="parsedquery_toString">+(((bi_fld:"a b"~5)~0.01 (bi_fld:c)~0.01 (bi_fld:d)~0.01)~3)
(bi_fld:"a b c d"~4)~0.01</str>


So is this an edismax bug?



FYI,   I am running Solr 4.4. I have fields defined like so:
<fieldtype name="text_cjk_bi" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true"
hangul="true" outputUnigrams="false" />
  </analyzer>
</fieldtype>

The request handler uses edismax:

<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="q.alt">:</str>
<str name="mm">6<-1 6<90%</str>
<int name="qs">1</int>
<int name="ps">0</int>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message