lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: synonyms.txt file updated frequently
Date Wed, 31 Dec 2008 14:07:28 GMT

On Dec 30, 2008, at 4:38 PM, Smiley, David W. wrote:

> Grant, the Solr wiki recommends doing expansion at index time and  
> gives reasons:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46
>

I personally think "recommends" is too strong of a word, but the  
points are valid reasons to do index time synonyms.  In Alexandar's  
case, I think index-time is a bit more problematic, since he is  
frequently updating the synonym list, meaning he would have to reindex  
every time, otherwise his stats are going to be even more skewed.

As for multi-word expansions, the query parser can be fixed or an  
alternate one used.


> Query-time doesn't work for multi-word expansion.  For everyone's  
> convenience, I'll quote the remainder of the problems:
>
>
> Even when you aren't worried about multi-word synonyms, idf  
> differences still make index time synonyms a good idea. Consider the  
> following scenario:
>
>    *  An index with a "text" field, which at query time uses the  
> SynonymFilter with the synonym TV, Televesion and expand="true"
>    *  Many thousands of documents containing the term "text:TV"
>    *  A few hundred documents containing the term "text:Television"
>
> A query for text:TV will expand into (text:TV text:Television) and  
> the lower docFreq for text:Television will give the documents that  
> match "Television" a much higher score then docs that match "TV"  
> comparably -- which may be somewhat counter intuitive to the client.  
> Index time expansion (or reduction) will result in the same idf for  
> all documents regardless of which term the original text contained.
>
> ~ David Smiley
>
> On 12/30/08 4:33 PM, "Grant Ingersoll" <gsingers@apache.org> wrote:
>
>
>
> On Dec 30, 2008, at 11:05 AM, Alexander Ramos Jardim wrote:
>
>> Hey Grant,
>>
>> Thanks for the info!
>>
>> 2008/12/30 Grant Ingersoll <gsingers@apache.org>
>>
>>> I'd probably write a new TokenFilter that was aware of the reload
>>> policy
>>> (in a generic way) such that I didn't have to go through a whole
>>> core reload
>>> every time.  Are you just using them during query time or also  
>>> during
>>> indexing?
>>>
>>
>> I am using it at indexing time.
>
> I think that is a bit more problematic.  How do you deal with new
> documents having the new synonyms while old docs don't?
>
> Any particular reason you use syns at indexing and not search?  Not
> saying there aren't reasons to do it, just query side usually works
> better for this very reason.
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Mime
View raw message