lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: solr example synonyms file
Date Wed, 03 Nov 2010 02:37:24 GMT
On Tue, Nov 2, 2010 at 10:08 PM, Mark Miller <markrmiller@gmail.com> wrote:
> On 11/2/10 9:57 PM, Robert Muir wrote:
>> On Tue, Nov 2, 2010 at 9:50 PM, Lance Norskog <goksron@gmail.com> wrote:
>>> I just used One Fish Two Fish Red Fish Blue Fish but I think that has
>>> license problems.
>>> Also, the sample should include multi-word left-hand values because they work.
>>>
>>
>> I don't think we should do this... i suggest only using single word
>> synonyms in the example for performance reasons!
>>
>> it doesnt really matter how rare they are: even "the quick brown fox"
>> => something is terrible, because its going to invoke SynonymFilter's
>> "slow path" for every single instance of "the".
>>
>> i know some insist its just an "example" and not defaults, but this
>> isn't true, else why did this email thread even come up? its used as
>> "defaults", and we should keep it very fast.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> We have discussed this before - there is always nasty compromise when it
> comes to example vs default. Good for one is often not good for the
> other. But like it or not, our example pretty much is the defacto
> default as you say.
>
> As a reminder, in the past we have talked about doing both an example
> with all the bells and whistles, and a performance config that you
> should really start from. But we have not gotten there obviously ;) Adds
> some dev/maint overhead as well.
>
> No real points, just chiming in with that.
>

another idea i started for textTight, happy to try and wrap it up /
contribute if there is interest.
but this is really only applicable to 'textTight', since its stemming
etc isn't insane like 'text'
I generated the following with a mix of automatic and manual methods
from 2+2lemma.txt (http://wordlist.sourceforge.net/ public domain/BSD)
i'm sure other people must suffer with similar tuning like this...
here's just some examples

sample synonyms for textTight, built from only variant spellings
(mostly brit <-> us):
barbeque => barbecue
blonde => blond
conventionalising => conventionalizing
convertor => converter
conveyers => conveyors
...

sample stemmer corrections for textTight, the plural-only stemmer (via
StemmerOverrideFilter):
errata    erratum
news    news
radii      radius
cavalrymen	cavalryman
...

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message