lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francisco Sanmartin <francis...@olx.com>
Subject solr is highlighting wrong words
Date Thu, 04 Sep 2008 20:25:13 GMT
Highlighting in Solr has a strange behavior in some items. I attach an 
example to see if anyone can throw some light at it.  Basically solr is 
highlighting wrong words. I'm looking for the word "car" and I tell solr 
to highlight it with the code <strong> and </strong>. The response is ok 
in most of the cases, but there are some items that appear with the 
wrong words highlighted. I attach an example at the bottom.


The problem of this example is that is highlighting the word "his", but 
the search word is "car".
This is the scenario:

Solr 1.2
The url:
http://solr-server:8983/solr/select/?q=id:11439968%20AND%20description%3Acar&hl=on&hl.fl=description&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%20%3C%2Fstrong%3E

The query fancy style:
<lst name="params">
<str name="hl.simple.pre"><strong></str>
<str name="hl.simple.post"> </strong></str>
<str name="hl.fl">description</str>
<str name="hl">on</str>
<str name="q">id:11439968 AND description:car</str>
</lst>

(I query with the id to obtain the item that is failing in highlighing, 
so everything is more clear).

The response:
<result name="response" numFound="1" start="0">
  <doc>
    ...
    <int name="id">11439968</int>
     ...
     <str name="description">
      This is a one of a kind all custom &#39;95 Integra LS with 2005 
TSX headlight and tailight conversion. It has GSR all black interior, 18 
inch rims,     strut bars, cd changer, coil overs, HID headlights, 
catback exhaust, intake, new clutch and brakes. Motor has 130,000 miles. 
No smoke or leaks.     Runs great. This car is completly shaved. Paint 
is a two toned black/white with white ice flake. It is flawless and 
ready to show. This car has not     even seen winter after being built! 
It is stored in a garage all year. Serious inquires only (203)994-0085. 
OR Email GUNITGN@yahoo.com. $8,500     OR BEST OFFER!!!!!
    </str>
    ...
  </doc>
<lst name="highlighting">
    <lst name="11439968">
        <arr name="description">
            <str>
                back exhaust, intake, new clutch and brakes. Motor has 
130,000 miles. No smoke or leaks. Runs great. T<strong>his </strong>
            </str>
        </arr>
    </lst>
</lst>
</response>

The schema (relevant parts);

<field name="description"            type="text_html"   indexed="true" stored="true"/>

...

     <fieldtype name="text_html" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
          <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true"
expand="true"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true"/>
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>


Thanks in advance.

Pako



Mime
View raw message