lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven A Rowe <sar...@syr.edu>
Subject RE: Solr synonyms format query time vs index time
Date Tue, 17 Aug 2010 18:56:24 GMT
Hi Michael,

I think the problem you're seeing is that no document contains "reebox", and you've used the
"explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term). 

I'm guessing that if you convert your synonym file from:

	reebox => Reebok

to:

	reebox, Reebok

and leave expand=true, and then reindex, everything will work: your indexed documents containing
"Reebok" will be made to include "reebox", so queries for "reebox" will produce hits on those
documents.

Steve

> -----Original Message-----
> From: mtdowling [mailto:mtdowling@gmail.com]
> Sent: Tuesday, August 17, 2010 2:24 PM
> To: solr-user@lucene.apache.org
> Subject: Solr synonyms format query time vs index time
> 
> 
> My company recently started using Solr for site search and autocomplete.
> It's working great, but we're running into a problem with synonyms.  We
> are
> generating a synonyms.txt file from a database table and using that
> synonyms.txt file at index time on a text type field.  Here's an excerpt
> from the synonyms file:
> 
> reebox => Reebok
> shinguards => Shin Guards
> shirt => T-Shirt,Shirt
> shmak => Shmack
> shocks => shox
> skateboard => Skate
> skateboarding => Skate
> skater => Skate
> skates => Skate
> skating => Skate
> skirt => Dresses
> 
> When we do a search for reebox, we want the term to be mapped to "Reebok"
> through explicit mapping, but for some reason this isn't happening.  We do
> have multi-word synonyms, and from what I've read on the mailing list,
> those
> only work at index time, so we are only using the synonym filter factory
> at
> index time:
> 
> <fieldType name="search" class="solr.TextField"
> positionIncrementGap="100">
>             <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>             <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>                 <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/>
>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>             </analyzer>
>         </fieldType>
> 
> Here's more relevant schema.xml configs:
> 
> <field name="mashup" type="search" indexed="true" stored="false"
> multiValued="true"/>
> <copyField source="keywords" dest="mashup"/>
> <copyField source="category" dest="mashup"/>
> <copyField source="name" dest="mashup"/>
> <copyField source="brand" dest="mashup"/>
> <copyField source="description_overview" dest="mashup"/>
> <copyField source="sku" dest="mashup"/>
> <!-- other copy fields... -->
> 
> The output of the query analyzer shows the following:
> 
> Query Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> 
> So "reebox" is never being converted to "Reebok".  I thought that if I had
> index time synonyms with expansion configured that I wouldn't need query
> time synonyms.  Maybe my dynamic synonyms generation isn't formatted
> correctly for my desired result?
> 
> If I use the same synonyms.txt file and use the index analyzer, reebox is
> mapped to Reebok and then indexed correctly:
> 
> Index Analyzer
> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
> term position 	1
> term text 	reebox
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
> expand=true, ignoreCase=true}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
> ignoreCase=true}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.WordDelimiterFilterFactory
> {generateNumberParts=0,
> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
> term position 	1
> term text 	Reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.LowerCaseFilterFactory {}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.SnowballPorterFilterFactory
> {protected=protwords.txt, language=English}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
> term position 	1
> term text 	reebok
> term type 	word
> source start,end 	0,6
> payload
> 
> 
> Should I use equivalent mapping instead of explicit mapping if I'm only
> using index-time synonyms?  Or should I turn query time synonyms on for my
> search field?
> 
> Thanks,
> Michael
Mime
View raw message