Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 69221 invoked from network); 3 Mar 2008 15:35:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 3 Mar 2008 15:35:58 -0000 Received: (qmail 47567 invoked by uid 500); 3 Mar 2008 15:35:47 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 47528 invoked by uid 500); 3 Mar 2008 15:35:47 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 47516 invoked by uid 99); 3 Mar 2008 15:35:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Mar 2008 07:35:47 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.202] (HELO spunkymail-a20.g.dreamhost.com) (208.97.132.202) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Mar 2008 15:35:10 +0000 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a20.g.dreamhost.com (Postfix) with ESMTP id 0F3C3E2510 for ; Mon, 3 Mar 2008 07:35:17 -0800 (PST) Message-Id: From: Grant Ingersoll To: java-user@lucene.apache.org In-Reply-To: <47CBD5A0.7000907@propylon.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: bigram analysis Date: Mon, 3 Mar 2008 10:35:17 -0500 References: <47CBD5A0.7000907@propylon.com> X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org On Mar 3, 2008, at 5:40 AM, John Byrne wrote: > Hi, > > I need to use stop-word bigrams, liike the Nutch analyzer, as > described in LIA 4.8 (Nutch Analysis). What I don't understand is, > why does it keep the original stop word intact? I can see great > advantage to being able to search for a combination of stop word + > real word, but I don't see the point of keeping the stop word as a > token on it's own. Searches with just that word would be as > pointless as ever. I don't know, Google allows for stopword searches. Just try "the" as a query (although it is kind of funny what the results are: "The Onion" is the top use of the in the world? And it is even more curious that people actually bought ads for the word "the", but that is a digression). I don't exactly know Nutch's analyzer, but it could be that it helps with phrases. I suppose one would have to look at Nutch's query parser as well to get a sense of how they are used. > > > Is the idea to allow searching on all stop words, even on their own, > and the bigrams are just an optimization that will improve things > 90% of the time? Or is it just a side effect of the bigram analyzer > that it produces a token from the stop word, and therefore it could > just be filtered out by a stop word filter afterwards, leaving only > the bigram and the original (non-stop) word? Not sure, you might want to ask on Nutch. From a strict language standpoint, the notion of a stopword in my mind is a bit dubious. If the word really has no meaning, then why does the language have it to begin with? In a search context, it has been treated as of minimal use in the early days mostly because of space and memory considerations. Now a days, as we get more sophisticated in our search capabilities, I think it can be useful for doing better phrase matching, etc. as well as more advanced NLP search. Now it seems like the general response is disk is cheap, why throw away information? > > > I'm sure either way would work fr me - just wondering what is > normally done, and if I'm missing something important here... "It depends". I think most Lucene users just use the generally held assumption that you should remove stopwords, but I am not sure. At a minimum, I think the answer is it depends on the application. If you want to do what you describe above, I would keep them. In the end, the IDF factor should handle the commonality of them quite nicely so as any use of them as a general term (and not part of a phrase) will not affect relevance all that much. -Grant -------------------------- Grant Ingersoll http://www.lucenebootcamp.com Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org