lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephanie Belton" <s...@zizou.net>
Subject Using dismax to find multiple terms across multiple fields
Date Thu, 30 Nov 2006 13:09:41 GMT
Hello,

I am using Solr to index and search documents in Russian. I have successfully set up the RussianAnalyzer
but found that it eliminates some tokens such as numbers. I am therefore indexing my text
fields in 2 ways, once with a quite literal version of the text using something similar to
textTight in the example config:

    <fieldtype name="text_literal" class="solr.TextField" positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="false"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>
 
And I index my fields again using the RussianAnalyzer to cover the Russian stemming and stop
words:
    <fieldtype name="text_ru_RU" class="solr.TextField"  >
      <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
    </fieldtype>

I then specify my field names:
   <dynamicField name="*_ru_RU"   type="text_ru_RU" indexed="true" stored="false"/>
   <dynamicField name="*_literal" type="text_literal" indexed="true" stored="false"/>

And use the copyField feature to index them twice:
   <copyField source="title_ru_RU"         dest="title_literal"    />
   <copyField source="location_ru_RU"   dest="location_literal" />
   <copyField source="body_ru_RU"       dest="body_literal"     />

I then specify my own DisMaxRequestHandler in solrconfig.xml:
  <requestHandler name="dismax_ru_RU" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <float name="tie">0.01</float>
     <str name="qf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 location_literal^0.5
location_ru_RU^0.4  </str>
     <str name="pf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 location_literal^0.5
location_ru_RU^0.4  </str>
     <str name="mm">
        100%
     </str>
     <int name="ps">100</int>
    </lst>
  </requestHandler>
 
Because I am searching through classified ads, date sorting is more important to me than relevance.
Therefore I am sorting by date first and then by score. I expect the system to return all
matches for todays ads sorted by relevance, followed by matches for yesterday’s ads sorted
by relevance etc. I would also like the search to only return ads where every single term
of the query was found across my 3 fields (title, body, location). I can’t seem to get this
to work. When I do a search for ‘1970’, it works fine and returns 2 ads containing 1970.
If I search for ‘Ташкент’ I get 3 results incl. one with Russian stemming (Ташкента).
But when I do a search for ‘1970 Ташкента’ it seems to ignore 1970 and give me
the same results as only looking for ‘Ташкент’. I got it to display the debug info
and 1970 seems to be ignored in the matching:

<lst name="debug">
 <str name="rawquerystring">"1970 Ташкент"</str>
 <str name="querystring">"1970 Ташкент"</str>
 <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970
ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5
| location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01) DisjunctionMaxQuery((body_ru_RU:ташкент^0.8
| body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970
ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
 <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970
ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5
| location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8
| body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970
ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str>
 <lst name="explain">
  <str name="id=€#26;੥,internal_docid=4">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
</str>
  <str name="id=€#26;ી,internal_docid=9">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
</str>
  <str name="id=€#26;੕,internal_docid=2">
0.43162674 = (MATCH) sum of:
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
</str>
 </lst>

Apologies for the verbosity, can anyone help me achieving my goal?

Thanks
Stephanie



Mime
View raw message