lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Stopwords in phrases
Date Tue, 21 Dec 2004 16:18:48 GMT
On Dec 21, 2004, at 10:41 AM, Ravi wrote:
>  I want to be able to use stopwords in exact phrase searches. I have
> looked at Nutch and used the same approach (replace common words with
> n-grams. Look at net.nutch.analysis.CommonGrams).
>   So if "to","be","or" and "not" are stop words, for the string "to be
> or not to be", the analyzer produces the following tokens
> [to-be, to-be-or, to-be-or-not, to-be-or-not-to, to-be-or-not-to-be,
> be-or, be-or-not, be-or-not-to, be-or-not-to-be, or-not, or-not-to,
> or-not-to-be, not-to, not-to-be, to-be]

You've gone a bit beyond what Nutch is using.  It creates bigrams, 
where you've expanded it to many more than that.

Are you also using the position increment of 0 for the "gram" tokens 
like Nutch does?

>   But I'm having a problem with the search.
>  when I do a search on "not to be" the analyzer is converting my search
> into
>   content:"not-to not-to-be to-be" because the analyzer produces the
> tokens "not-to","not-to-be","to-be"
>   I'm getting 0 results on this as there is no token "not-to not-to-be
> to-be" in the index.
>   I want just "not-to-be" from the analyzer during the search so when I
> search on "not to be" I will get the document which has "not-to-be" as 
> a
> token.
>    How can I use the same analyzer to get different results in indexing
> and searching?

Nutch does some different stuff between indexing and parsing queries...

      [java] 1: [the:<WORD>] [the-quick:gram]
      [java] 2: [quick:<WORD>]
      [java] 3: [brown:<WORD>]
      [java] 4: [fox:<WORD>]
      [java] query = (+url:"the quick brown"^4.0) (+anchor:"the quick 
brown"^2.0) (+content:"the-quick quick brown")

The first four lines show the analysis of "the quick brown fox".  The 
last line is the resultant Lucene query for "the quick brown".  Notice 
that only the "content" field gets analyzed specially, and also that 
only "gram" tokens are considered in that field, not the <WORD> tokens 
if there is also a "gram".

Does this help with your situation?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message