Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 133EA96A3 for ; Mon, 28 Nov 2011 20:52:01 +0000 (UTC) Received: (qmail 53531 invoked by uid 500); 28 Nov 2011 20:51:58 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 53469 invoked by uid 500); 28 Nov 2011 20:51:58 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 53461 invoked by uid 99); 28 Nov 2011 20:51:58 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 20:51:58 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.lea@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Nov 2011 20:51:53 +0000 Received: by iaqq3 with SMTP id q3so11869176iaq.35 for ; Mon, 28 Nov 2011 12:51:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=QA1w60I845fiJ6hq2RKx/U3QdP1D+6W6FRQC81jhAZI=; b=jEWLwGhnNfMCdoR/6rSLofl+fxdWiuN1ZB+DcmM2e+1ybpq7xxkm8+Q4SB38tL8+CZ 3IEMVo2zCQ8mXVmQdsFzuRwlrl9pghl8du4zCinLj3Jmv05UyPIainjHfNyVIMbAN130 c0NjjAhpwpCGFJTPslZkYhQZe1+/l/6R6CI04= Received: by 10.43.46.1 with SMTP id um1mr26851397icb.18.1322513493313; Mon, 28 Nov 2011 12:51:33 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.103.69 with HTTP; Mon, 28 Nov 2011 12:51:13 -0800 (PST) In-Reply-To: <4ED3EFB4.1050805@digitorial.co.uk> References: <4ED3DC80.9010606@digitorial.co.uk> <6C78E97C707B5B4C8CC61D44F8754586072F8C@SUEX10-mbx-03.ad.syr.edu> <4ED3EFB4.1050805@digitorial.co.uk> From: Ian Lea Date: Mon, 28 Nov 2011 20:51:13 +0000 Message-ID: Subject: Re: Analysers for newspaper pages... To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable You can easily use just the CommonGrams stuff from Solr in your pure lucene project. There are a couple of useful docs on stop words and common grams et al at http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-= words-part-1 http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-= words-part-2 -- Ian. On Mon, Nov 28, 2011 at 8:31 PM, Dawn Zo=EB Raison = wrote: > Hi Steve, > > On 28/11/2011 19:43, Steven A Rowe wrote: >> >> I assume that when you refer to "the impact of stop words," you're >> concerned about query-time performance? =A0You should consider the possi= bility >> that performance without removing stop words is good enough that you won= 't >> have to take any steps to address the issue. > > Not to fussed about query-time performance; certainly no-one has complain= ed > so far. It's more the sheer number of junk pages we get searching on phra= ses > that contain stop words - it can lead to hundreds of thousands of results= , > and the pedants among our userbase insist on paging through the lot :-| > > I'd much rather contain the stop words using a *gram based approach and > offer a less populous but more accurate resultset. > >> >> That said, there are two filters in Solr 3.X[1] that would do the >> equivalent of what you have outlined: >> CommonGramsFilter >> =A0and >> CommonGramsQueryFilter. > > We use lucene directly, but I'll take a look - Thanks. > >> You can use these filters with a Lucene 3.X application by including the >> (same-versioned) solr-core jar as a dependency. >> >> Steve > > -- > > Rgds. > *Dawn Raison* > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org