Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 16276 invoked from network); 21 Dec 2004 16:24:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur-2.apache.org with SMTP; 21 Dec 2004 16:24:52 -0000 Received: (qmail 38893 invoked by uid 500); 21 Dec 2004 16:19:02 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 38864 invoked by uid 500); 21 Dec 2004 16:19:01 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 38848 invoked by uid 99); 21 Dec 2004 16:19:01 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (hermes.apache.org: local policy) Received: from fork6.mail.Virginia.EDU (HELO fork6.mail.virginia.edu) (128.143.2.176) by apache.org (qpsmtpd/0.28) with ESMTP; Tue, 21 Dec 2004 08:18:57 -0800 Received: from localhost (localhost [127.0.0.1]) by fork6.mail.virginia.edu (Postfix) with ESMTP id 3EBF81BFA2 for ; Tue, 21 Dec 2004 11:18:43 -0500 (EST) Received: from fork6.mail.virginia.edu ([127.0.0.1]) by localhost (fork6.mail.virginia.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 15977-02 for ; Tue, 21 Dec 2004 11:18:42 -0500 (EST) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by fork6.mail.virginia.edu (Postfix) with ESMTP id E2B171C1EE for ; Tue, 21 Dec 2004 11:18:42 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v619) In-Reply-To: <4568BE33B520DD43B7B24DE0B609B215523792@htexc.hq.htinc.com> References: <4568BE33B520DD43B7B24DE0B609B215523792@htexc.hq.htinc.com> Content-Type: text/plain; charset=US-ASCII; format=flowed Message-Id: Content-Transfer-Encoding: 7bit From: Erik Hatcher Subject: Re: Stopwords in phrases Date: Tue, 21 Dec 2004 11:18:48 -0500 To: "Lucene Users List" X-Mailer: Apple Mail (2.619) X-UVA-Virus-Scanned: by amavisd-new at fork6.mail.virginia.edu X-Virus-Checked: Checked X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N On Dec 21, 2004, at 10:41 AM, Ravi wrote: > I want to be able to use stopwords in exact phrase searches. I have > looked at Nutch and used the same approach (replace common words with > n-grams. Look at net.nutch.analysis.CommonGrams). > So if "to","be","or" and "not" are stop words, for the string "to be > or not to be", the analyzer produces the following tokens > > [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be, > be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to, > or-not-to-be, not-to, not-to-be, to-be] You've gone a bit beyond what Nutch is using. It creates bigrams, where you've expanded it to many more than that. Are you also using the position increment of 0 for the "gram" tokens like Nutch does? > But I'm having a problem with the search. > when I do a search on "not to be" the analyzer is converting my search > into > content:"not-to not-to-be to-be" because the analyzer produces the > tokens "not-to","not-to-be","to-be" > > I'm getting 0 results on this as there is no token "not-to not-to-be > to-be" in the index. > > I want just "not-to-be" from the analyzer during the search so when I > search on "not to be" I will get the document which has "not-to-be" as > a > token. > > How can I use the same analyzer to get different results in indexing > and searching? Nutch does some different stuff between indexing and parsing queries... [java] 1: [the:] [the-quick:gram] [java] 2: [quick:] [java] 3: [brown:] [java] 4: [fox:] [java] query = (+url:"the quick brown"^4.0) (+anchor:"the quick brown"^2.0) (+content:"the-quick quick brown") The first four lines show the analysis of "the quick brown fox". The last line is the resultant Lucene query for "the quick brown". Notice that only the "content" field gets analyzed specially, and also that only "gram" tokens are considered in that field, not the tokens if there is also a "gram". Does this help with your situation? Erik --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org