Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 56672 invoked from network); 26 Oct 2009 14:30:06 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Oct 2009 14:30:06 -0000 Received: (qmail 94583 invoked by uid 500); 26 Oct 2009 14:30:05 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 94523 invoked by uid 500); 26 Oct 2009 14:30:04 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 94513 invoked by uid 99); 26 Oct 2009 14:30:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Oct 2009 14:30:04 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lists@nabble.com designates 216.139.236.158 as permitted sender) Received: from [216.139.236.158] (HELO kuber.nabble.com) (216.139.236.158) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Oct 2009 14:29:55 +0000 Received: from isper.nabble.com ([192.168.236.156]) by kuber.nabble.com with esmtp (Exim 4.63) (envelope-from ) id 1N2QZi-0004Wy-Cv for general@lucene.apache.org; Mon, 26 Oct 2009 07:29:34 -0700 Message-ID: <26060874.post@talk.nabble.com> Date: Mon, 26 Oct 2009 07:29:34 -0700 (PDT) From: poeta simbolista To: general@lucene.apache.org Subject: Solution for unwanted ngrams MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-From: poetasimbolista@gmail.com X-Virus-Checked: Checked by ClamAV on apache.org Hi, Imagine you have a text : "Apartment not for sale". and another "Sale! Apartment for rent" Search query: "Apartment for sale". The above search query will return the texts above highly scored. I would like to know how I could tackle the following issue better with Lucene. My ideas: - recognise certain sets "Not for sale" as different from "for sale". That is, invalidate "for sale" if it comes preceded by "not". How could I do this? - Recognise sale only if preceded by "for", since the second meaning (bargain vs. something for sale) is tricky. - transcript "sale" as "for sale", grouped in the query (produce "-sale +(for sale)" ). Wouldn't that query invalidate those with the "sale" term? How to achieve this with Lucene otherwise? Should this be tackled only by preprocessing the data before it makes it to the index? Ideally I would like to preserve the original text on the index. Thanks a lot in advance Diego -- View this message in context: http://www.nabble.com/Solution-for-unwanted-ngrams-tp26060874p26060874.html Sent from the Lucene - General mailing list archive at Nabble.com.