lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <goks...@gmail.com>
Subject Re: Solr synonyms format query time vs index time
Date Tue, 17 Aug 2010 20:35:57 GMT
solr/admin/analysis.jsp lets you see how this works. Use the index boxes.

Lance

On Tue, Aug 17, 2010 at 11:56 AM, Steven A Rowe <sarowe@syr.edu> wrote:
> Hi Michael,
>
> I think the problem you're seeing is that no document contains "reebox", and you've used
the "explicit" syntax (source=>dest) instead of the "equivalent" syntax (term,term,term).
>
> I'm guessing that if you convert your synonym file from:
>
>        reebox => Reebok
>
> to:
>
>        reebox, Reebok
>
> and leave expand=true, and then reindex, everything will work: your indexed documents
containing "Reebok" will be made to include "reebox", so queries for "reebox" will produce
hits on those documents.
>
> Steve
>
>> -----Original Message-----
>> From: mtdowling [mailto:mtdowling@gmail.com]
>> Sent: Tuesday, August 17, 2010 2:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Solr synonyms format query time vs index time
>>
>>
>> My company recently started using Solr for site search and autocomplete.
>> It's working great, but we're running into a problem with synonyms.  We
>> are
>> generating a synonyms.txt file from a database table and using that
>> synonyms.txt file at index time on a text type field.  Here's an excerpt
>> from the synonyms file:
>>
>> reebox => Reebok
>> shinguards => Shin Guards
>> shirt => T-Shirt,Shirt
>> shmak => Shmack
>> shocks => shox
>> skateboard => Skate
>> skateboarding => Skate
>> skater => Skate
>> skates => Skate
>> skating => Skate
>> skirt => Dresses
>>
>> When we do a search for reebox, we want the term to be mapped to "Reebok"
>> through explicit mapping, but for some reason this isn't happening.  We do
>> have multi-word synonyms, and from what I've read on the mailing list,
>> those
>> only work at index time, so we are only using the synonym filter factory
>> at
>> index time:
>>
>> <fieldType name="search" class="solr.TextField"
>> positionIncrementGap="100">
>>             <analyzer type="index">
>>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>                 <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>                 <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>                 <filter class="solr.LowerCaseFilterFactory"/>
>>                 <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>             </analyzer>
>>             <analyzer type="query">
>>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>                 <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>                 <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0" catenateWords="1"
>> catenateNumbers="1" catenateAll="0"/>
>>                 <filter class="solr.LowerCaseFilterFactory"/>
>>                 <filter class="solr.SnowballPorterFilterFactory"
>> language="English" protected="protwords.txt"/>
>>                 <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>             </analyzer>
>>         </fieldType>
>>
>> Here's more relevant schema.xml configs:
>>
>> <field name="mashup" type="search" indexed="true" stored="false"
>> multiValued="true"/>
>> <copyField source="keywords" dest="mashup"/>
>> <copyField source="category" dest="mashup"/>
>> <copyField source="name" dest="mashup"/>
>> <copyField source="brand" dest="mashup"/>
>> <copyField source="description_overview" dest="mashup"/>
>> <copyField source="sku" dest="mashup"/>
>> <!-- other copy fields... -->
>>
>> The output of the query analyzer shows the following:
>>
>> Query Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>>
>> So "reebox" is never being converted to "Reebok".  I thought that if I had
>> index time synonyms with expansion configured that I wouldn't need query
>> time synonyms.  Maybe my dynamic synonyms generation isn't formatted
>> correctly for my desired result?
>>
>> If I use the same synonyms.txt file and use the index analyzer, reebox is
>> mapped to Reebok and then indexed correctly:
>>
>> Index Analyzer
>> org.apache.solr.analysis.WhitespaceTokenizerFactory {}
>> term position         1
>> term text     reebox
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
>> expand=true, ignoreCase=true}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
>> ignoreCase=true}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.WordDelimiterFilterFactory
>> {generateNumberParts=0,
>> catenateWords=1, generateWordParts=0, catenateAll=0, catenateNumbers=1}
>> term position         1
>> term text     Reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.LowerCaseFilterFactory {}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.SnowballPorterFilterFactory
>> {protected=protwords.txt, language=English}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>> org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
>> term position         1
>> term text     reebok
>> term type     word
>> source start,end      0,6
>> payload
>>
>>
>> Should I use equivalent mapping instead of explicit mapping if I'm only
>> using index-time synonyms?  Or should I turn query time synonyms on for my
>> search field?
>>
>> Thanks,
>> Michael
>



-- 
Lance Norskog
goksron@gmail.com

Mime
View raw message