Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 906E3912A for ; Mon, 28 Nov 2011 20:32:44 +0000 (UTC) Received: (qmail 15890 invoked by uid 500); 28 Nov 2011 20:32:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 15816 invoked by uid 500); 28 Nov 2011 20:32:42 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 15807 invoked by uid 99); 28 Nov 2011 20:32:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 20:32:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.241.168.151] (HELO mua02.mx.cix.co.uk) (212.241.168.151) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 20:32:31 +0000 X-Envelope-From: dawn@digitorial.co.uk Received: from mail.lemur.co.uk (host81-131-181-193.range81-131.btcentralplus.com [81.131.181.193]) by mua02.mx.cix.co.uk (8.13.8/8.13.4) with SMTP id pASKWAnb001987 for ; Mon, 28 Nov 2011 20:32:10 GMT Received: from [192.168.1.250] ([192.168.1.250]) by mail.lemur.co.uk with hMailServer ; Mon, 28 Nov 2011 20:32:07 +0000 Message-ID: <4ED3EFB4.1050805@digitorial.co.uk> Date: Mon, 28 Nov 2011 20:31:48 +0000 From: =?UTF-8?B?RGF3biBab8OrIFJhaXNvbg==?= User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: Analysers for newspaper pages... References: <4ED3DC80.9010606@digitorial.co.uk> <6C78E97C707B5B4C8CC61D44F8754586072F8C@SUEX10-mbx-03.ad.syr.edu> In-Reply-To: <6C78E97C707B5B4C8CC61D44F8754586072F8C@SUEX10-mbx-03.ad.syr.edu> Content-Type: multipart/alternative; boundary="------------020002070303030009010805" X-Virus-Checked: Checked by ClamAV on apache.org --------------020002070303030009010805 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Hi Steve, On 28/11/2011 19:43, Steven A Rowe wrote: > I assume that when you refer to "the impact of stop words," you're concerned about query-time performance? You should consider the possibility that performance without removing stop words is good enough that you won't have to take any steps to address the issue. Not to fussed about query-time performance; certainly no-one has complained so far. It's more the sheer number of junk pages we get searching on phrases that contain stop words - it can lead to hundreds of thousands of results, and the pedants among our userbase insist on paging through the lot :-| I'd much rather contain the stop words using a *gram based approach and offer a less populous but more accurate resultset. > > That said, there are two filters in Solr 3.X[1] that would do the equivalent of what you have outlined: CommonGramsFilter and CommonGramsQueryFilter. We use lucene directly, but I'll take a look - Thanks. > You can use these filters with a Lucene 3.X application by including the (same-versioned) solr-core jar as a dependency. > > Steve -- Rgds. *Dawn Raison* --------------020002070303030009010805--