Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 27315 invoked from network); 3 Nov 2010 02:37:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 02:37:43 -0000 Received: (qmail 50330 invoked by uid 500); 3 Nov 2010 02:38:14 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 50235 invoked by uid 500); 3 Nov 2010 02:38:13 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 50228 invoked by uid 99); 3 Nov 2010 02:38:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 02:38:13 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rcmuir@gmail.com designates 209.85.214.48 as permitted sender) Received: from [209.85.214.48] (HELO mail-bw0-f48.google.com) (209.85.214.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 02:38:05 +0000 Received: by bwz19 with SMTP id 19so194702bwz.35 for ; Tue, 02 Nov 2010 19:37:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=jA9bZaVv9pTnOPaszB/92r2V89xfJkbC2VVm3BOx4Hw=; b=WBuidgNAaWTizq6ibUgmbQqmFbV/2lLkN7dV0FVDjZQ1ADhnrqAdjW3ZjsTpOB+a4r XZepP9h7Y4nxSXzit3RvYdmvi24yd9Wo73vMNPw4tW9gypon6C7Jv8e1bLHA5CHw1J9s b48oA2eMpWHa2mE6ZGsfOBwV2nT1Wy5WmhnwU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=vXT608wfLtGn2OTK9Yl6rBtRBUe5rU+Q3ZC38SYB1LBoJwKywzXqLJeYAcZa8+cWED aUrsVl9ggV7u7O6aEXneBTvFHEebsKAwAybxP152hUCzFMwMzhs8LZF42ZBNzjWFGrrC jNbaB5RrJ1yu1Z86+m/vFPayz2cGDtMITNoa4= Received: by 10.204.52.193 with SMTP id j1mr6018791bkg.52.1288751864477; Tue, 02 Nov 2010 19:37:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.77.201 with HTTP; Tue, 2 Nov 2010 19:37:24 -0700 (PDT) In-Reply-To: <4CD0C41F.7030808@gmail.com> References: <4CD0C41F.7030808@gmail.com> From: Robert Muir Date: Tue, 2 Nov 2010 22:37:24 -0400 Message-ID: Subject: Re: solr example synonyms file To: dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Nov 2, 2010 at 10:08 PM, Mark Miller wrote: > On 11/2/10 9:57 PM, Robert Muir wrote: >> On Tue, Nov 2, 2010 at 9:50 PM, Lance Norskog wrote: >>> I just used One Fish Two Fish Red Fish Blue Fish but I think that has >>> license problems. >>> Also, the sample should include multi-word left-hand values because they work. >>> >> >> I don't think we should do this... i suggest only using single word >> synonyms in the example for performance reasons! >> >> it doesnt really matter how rare they are: even "the quick brown fox" >> => something is terrible, because its going to invoke SynonymFilter's >> "slow path" for every single instance of "the". >> >> i know some insist its just an "example" and not defaults, but this >> isn't true, else why did this email thread even come up? its used as >> "defaults", and we should keep it very fast. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: dev-help@lucene.apache.org >> > > We have discussed this before - there is always nasty compromise when it > comes to example vs default. Good for one is often not good for the > other. But like it or not, our example pretty much is the defacto > default as you say. > > As a reminder, in the past we have talked about doing both an example > with all the bells and whistles, and a performance config that you > should really start from. But we have not gotten there obviously ;) Adds > some dev/maint overhead as well. > > No real points, just chiming in with that. > another idea i started for textTight, happy to try and wrap it up / contribute if there is interest. but this is really only applicable to 'textTight', since its stemming etc isn't insane like 'text' I generated the following with a mix of automatic and manual methods from 2+2lemma.txt (http://wordlist.sourceforge.net/ public domain/BSD) i'm sure other people must suffer with similar tuning like this... here's just some examples sample synonyms for textTight, built from only variant spellings (mostly brit <-> us): barbeque => barbecue blonde => blond conventionalising => conventionalizing convertor => converter conveyers => conveyors ... sample stemmer corrections for textTight, the plural-only stemmer (via StemmerOverrideFilter): errata erratum news news radii radius cavalrymen cavalryman ... --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org