lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cat Bieber <cbie...@techtarget.com>
Subject phrase query and string/keyword tokenizer
Date Thu, 14 Jun 2012 16:42:30 GMT
I have documents that are word definitions (basically an online 
dictionary) that can have alternate titles. For example the document 
entitled "Read-only memory" might have an alternate title of "ROM". In 
search results, I want to boost documents with an alternate title that 
is a case-insensitive "exact match" for the query text -- e.g. "rom" 
should work as well.

I'm running solr 3.6 and using edismax.

I've gone through a few iterations of this. What I have working best so 
far is a multi-valued text field for the alternate titles with a big boost:

<fieldType name="lowerCaseSort" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>

<field name="bestMatchTitle" type="lowerCaseSort" indexed="true" 
stored="false" multiValued="true"/>

This produces great results with single-word searches like the "ROM" 
example above. It runs into problems with a multi-word alternate title 
like "Blue Tooth". I have read some of the prior discussions about this, 
regarding how the query is parsed based on spaces before it gets to the 
keyword tokenizer for the field type.

The question I have is about phrase queries in this case. My request 
handler has:

<str name="qf">bestMatchTitle^20 title^5 summary^3 metaDescription^1.5 
body^1 author^0.5</str>
<str name="pf">bestMatchTitle^20 title^5 summary^3 metaDescription^1.5 
body^1 author^0.5</str>

When I run a query, I get this:

+((DisjunctionMaxQuery((metaDescription:blue^1.5 | summary:blue^3.0 | 
author:blue^0.5 | body:blue | title:blue^5.0 | 
bestMatchTitle:blue^20.0)~0.01) 
DisjunctionMaxQuery((metaDescription:tooth^1.5 | summary:tooth^3.0 | 
author:tooth^0.5 | body:tooth | title:tooth^5.0 | 
bestMatchTitle:tooth^20.0)~0.01))~2) 
DisjunctionMaxQuery((metaDescription:"blue tooth"~100^1.5 | 
summary:"blue tooth"~100^3.0 | body:"blue tooth"~100 | title:"blue 
tooth"~100^5.0)~0.01)

It looks like the phrase isn't being matched against my bestMatchTitle 
field. It also isn't matched against author, which is type string. So do 
phrases only get matched against certain field types?

When I put the quotes in the query text:

/select/?qt=best-match&q="blue+tooth"&debugQuery=on

It builds the query I was hoping to get:

+DisjunctionMaxQuery((metaDescription:"blue tooth"^1.5 | summary:"blue 
tooth"^3.0 | author:blue tooth^0.5 | body:"blue tooth" | title:"blue 
tooth"^5.0 | bestMatchTitle:blue tooth^20.0)~0.01)

But I still need the query on the individual tokens, otherwise it 
eliminates results that may be good hits. So far, any way I have tried 
to combine the two queries either opens up matching a ton of documents 
that shouldn't really match (e.g. total found goes from 24 to 4800+ 
documents) or doesn't match the one I want, giving poor results.

Does anyone have suggestions for how I can convince the phrase query to 
match against my bestMatchTitle field, or change the query text I'm 
passing in to combine these two queries and get the boost I want? Or is 
there another approach altogether that I'm missing?

Thanks for any help with this.
-Cat Bieber


Mime
View raw message