lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Victor Hadianto <vict...@nuix.com.au>
Subject Re: interesting phrase query issue
Date Thu, 17 Jul 2003 23:23:34 GMT
> One of these documents has the line "access, the
> manager".  When searching for the phrase "access manager", this document is
> being returned.  I understand why (at least i think i do), because a stop
> word is "the" and the "," is being removed by the tokenizer, my question is
> is there any way I can avoid having this returned in the results?  

I don't think you can't without reindexing the documents and changing 
QueryParser a bit. The reasons is although if you introduce your new 
tokenizer/analyzer the original documents have been indexed with those stop 
words removed.

You have to create an analyzer that doesn't drop your stop words and start the 
reindexing again.

However you must be careful when using your custom analyser to do the query 
parsing, because sometime you may want to drop the stop words in a non-quoted 
query, so 

hello and world ---> +hello +world

but

"hello and world" --> +"hello and world"

One solution that I can think of is by passing two analysers in QueryParser, 
one is for the "standard" analyser and the other is for the "phrase query" 
analyser. Down in the QueryParser.jj around this area do something like this:

     | term=<QUOTED>
       [ slop=<SLOP> ]
       [ <CARAT> boost=<NUMBER> ]
       {
         if (phraseAnalyzer == null)  {
		// use phrase query custom analyser that doesn't drop stop words
	} else {  
		 // otherwise use normal analyzer
	}

This may work as a matter of fact I think it should.

HTH

victor


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message