Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Message-Id: <F5AE183A-7F52-49F3-8AA9-D014E3A9E53F@apache.org>
From: Grant Ingersoll <gsingers@apache.org>
To: java-user@lucene.apache.org
In-Reply-To: <47CBD5A0.7000907@propylon.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: bigram analysis
Date: Mon, 3 Mar 2008 10:35:17 -0500
References: <47CBD5A0.7000907@propylon.com>


On Mar 3, 2008, at 5:40 AM, John Byrne wrote:

> Hi,
>
> I need to use stop-word bigrams, liike the Nutch analyzer, as  
> described in LIA 4.8 (Nutch Analysis). What I don't understand is,  
> why does it keep the original stop word intact? I can see great  
> advantage to being able to search for a combination of stop word +  
> real word, but I don't see the point of keeping the stop word as a  
> token on it's own. Searches with just that word would be as  
> pointless as ever.

I don't know, Google allows for stopword searches.  Just try "the" as  
a query (although it is kind of funny what the results are: "The  
Onion" is the top use of the in the world?  And it is even more  
curious that people actually bought ads for the word "the", but that  
is a digression).  I don't exactly know Nutch's analyzer, but it could  
be that it helps with phrases.  I suppose one would have to look at  
Nutch's query parser as well to get a sense of how they are used.

>
>
> Is the idea to allow searching on all stop words, even on their own,  
> and the bigrams are just an optimization that will improve things  
> 90% of the time? Or is it just a side effect of the bigram analyzer  
> that it produces a token from the stop word, and therefore it could  
> just be filtered out by a stop word filter afterwards, leaving only  
> the bigram and the original (non-stop) word?

Not sure, you might want to ask on Nutch.  From a strict language  
standpoint, the notion of a stopword in my mind is a bit dubious.  If  
the word really has no meaning, then why does the language have it to  
begin with?  In a search context, it has been treated as of minimal  
use in the early days mostly because of space and memory  
considerations.  Now a days, as we get more sophisticated in our  
search capabilities, I think it can be useful for doing better phrase  
matching, etc. as well as more advanced NLP search.  Now it seems like  
the general response is disk is cheap, why throw away information?


>
>
> I'm sure either way would work fr me - just wondering what is  
> normally done, and if I'm missing something important here...

"It depends".  I think most Lucene users just use the generally held  
assumption that you should remove stopwords, but I am not sure.  At a  
minimum, I think the answer is it depends on the application.  If you  
want to do what you describe above, I would keep them.  In the end,  
the IDF factor should handle the commonality of them quite nicely so  
as any use of them as a general term (and not part of a phrase) will  
not affect relevance all that much.

-Grant


--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org