lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Gonyea <sc...@aitrus.org>
Subject Dismax Filtering Hyphens? Why is this not working? How do I debug Dismax?
Date Mon, 04 Oct 2010 16:42:53 GMT
Wow, this is probably the most annoying Solr issue I've *ever* dealt
with. First question: How do I debug Dismax, and its query handling?

Issue: When I query against this StrField, I am attempting to do an
*exact* match...  Albeit one that is case-insensitive :).  So, 90%
exact.  It works in a majority of cases.  Indeed, I am teling Solr
that this field is my uniqueField and it enforces uniqueness
perfectly.  The issue comes about when I try to query a document,
based on a key in this field, and the key I'm using has hyphens
(dashes) in it.  Then I get zero results.  Very frustrating.

The keys will always be a URL.  IE,
"http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry"

Here's my configuration info...  schema.xml (the URL exists twice;
once in 'idstr' format, for uniqueness, and once in the 'url' form
below. I am querying against the 'idstr' field):

    <fieldType name="idstr"   class="solr.StrField">
      <analyzer>
        <tokenizer class="solr.PatternTokenizerFactory" pattern="(.*)"
group="1"/>
          <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    <fieldType name="url"     class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <!-- <tokenizer  class="solr.StandardTokenizerFactory"/> -->
        <tokenizer  class="solr.StandardTokenizerFactory"/>
        <filter     class="solr.LowerCaseFilterFactory"/>
        <filter     class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
        <filter     class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
<!-- snip -->
    <field name="id"            type="idstr"    stored="true"
indexed="true" required="true"/>
    <field name="url"           type="url"      stored="true"
indexed="true" required="true"/>
<!-- snip -->
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>content</defaultSearchField>
  <solrQueryParser defaultOperator="AND"/>


Yes, the PatternTokenizerFactory is inefficient for doing what I
wanted above. It was a quick hack, while I sought something to do
exactly what I'm doing above.  IE, exact / WHOLE string... but lower
case.

Here's my solrconfig.xml:


<requestHandler name="/rb" class="solr.SearchHandler" default="true"  >
  <lst name="defaults">
    <str name="defType">dismax</str>
    <str name="echoParams">explicit</str>
    <float name="tie">0.01</float>
    <str name="qf"> content&#94;1.5 anchor&#94;0.3 title&#94;1.2
mcode&#94;1.0 site_id&#94;1.0 priority&#94;1.0</str>
    <str name="fl"> * </str>
    <bool name="hl">true</bool>
    <str name="q.alt">*:*</str>
    <str name="hl.fl">content title</str>
    <str name="f.title.hl.fragsize">0</str>
    <str name="f.title.hl.alternateField">title</str>
    <str name="f.content.hl.fragmenter">regex3</str>
  </lst>
</requestHandler>


And, finally, when I run that sample URL through the query analyzer...
 here's the output (copied from the HTML)... I appreciate any/all help
anyone can provide.  Seriously.  I'll love you forever :(  :


<h3>Index Analyzer</h3>
<h4>org.apache.solr.analysis.PatternTokenizerFactory   null</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h4>org.apache.solr.analysis.LowerCaseFilterFactory   {}</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="highlight">http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h3>Query Analyzer</h3>
<h4>org.apache.solr.analysis.PatternTokenizerFactory   null</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/I-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>
<h4>org.apache.solr.analysis.LowerCaseFilterFactory   {}</h4>
<table width="auto" class="analysis" border="1">
<tr>
<th NOWRAP rowspan="1">term position</th>
<td class="debugdata">1</td></tr>
<tr>
<th NOWRAP rowspan="1">term text</th>
<td class="debugdata">http://helloworld.abc/i-ruin-your-queries-aghghaahahaagcry</td></tr>
<tr>
<th NOWRAP rowspan="1">term type</th>
<td class="debugdata">word</td></tr>
<tr>
<th NOWRAP rowspan="1">source start,end</th>
<td class="debugdata">0,58</td></tr>
<tr>
<th NOWRAP rowspan="1">payload</th>
<td class="debugdata"></td></tr>
</table>

Mime
View raw message