From java-user-return-32928-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Mon Mar 03 10:41:17 2008 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 40130 invoked from network); 3 Mar 2008 10:41:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Mar 2008 10:41:16 -0000 Received: (qmail 91597 invoked by uid 500); 3 Mar 2008 10:41:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 91564 invoked by uid 500); 3 Mar 2008 10:41:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 91552 invoked by uid 99); 3 Mar 2008 10:41:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Mar 2008 02:41:06 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [194.125.145.37] (HELO mercury.propylon.com) (194.125.145.37) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Mar 2008 10:40:19 +0000 Received: from propylon-sdsl.sdsl.esat.net ([193.120.101.26] helo=[192.168.213.116]) by mercury.propylon.com with esmtp (Exim 4.50) id 1JW82E-0006KM-Ig for java-user@lucene.apache.org; Mon, 03 Mar 2008 10:36:46 +0000 Message-ID: <47CBD5A0.7000907@propylon.com> Date: Mon, 03 Mar 2008 10:40:32 +0000 From: John Byrne User-Agent: Thunderbird 2.0.0.12 (Windows/20080213) MIME-Version: 1.0 To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-SA-Exim-Connect-IP: 193.120.101.26 X-SA-Exim-Mail-From: john.byrne@propylon.com Subject: bigram analysis X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on mercury.propylon.com X-Spam-Level: X-SA-Exim-Version: 4.2 (built Thu, 03 Mar 2005 10:44:12 +0100) X-SA-Exim-Scanned: Yes (on mercury.propylon.com) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=0.1 required=5.0 tests=AWL autolearn=failed version=3.0.3 Hi, I need to use stop-word bigrams, liike the Nutch analyzer, as described in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it keep the original stop word intact? I can see great advantage to being able to search for a combination of stop word + real word, but I don't see the point of keeping the stop word as a token on it's own. Searches with just that word would be as pointless as ever. Is the idea to allow searching on all stop words, even on their own, and the bigrams are just an optimization that will improve things 90% of the time? Or is it just a side effect of the bigram analyzer that it produces a token from the stop word, and therefore it could just be filtered out by a stop word filter afterwards, leaving only the bigram and the original (non-stop) word? I'm sure either way would work fr me - just wondering what is normally done, and if I'm missing something important here... Thanks! -John --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org