lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Dash Confusion in QueryParser - Bug? Feature?
Date Wed, 15 Oct 2003 00:01:13 GMT
On Saturday, October 11, 2003, at 09:44  AM, Michael Giles wrote:
> He is probably using the StandardAnalyzer.  I was about to write the  
> exact same email (but using Wal-Mart as an example on this page -  
> http://www.benchmark.com/cgi-bin/suid/~bcmlp/ 
> newsletter.cgi?mode=show&year=2003&date=2003-10-07). I index and  
> search with the same analyzer (Standard), but when I search for  
> Wal-Mart, I don't find a match.  I DO find a match if I search for  
> "Wal-Mart" or Wal Mart (no hyphen).  This seems like a bug.

Sorry for the delay.  I've been meaning to reply to this.

When you index using StandardAnalyzer, you are indexing it to two terms  
"wal" and "mart" (without the quotes).  QueryParser does its own  
(weird?) stuff to
strings passed to it.  Here's how it breaks down:

     String[] queries = {"Wal-Mart", "\"Wal-Mart\"", "Wal Mart"};
     for (int i = 0; i < queries.length; i++) {
       String query = queries[i];
       Query q = QueryParser.parse(query, "contents", new  
StandardAnalyzer());
       System.out.println(query + " = " + q);
     }

Wal-Mart = contents:wal -contents:mart
"Wal-Mart" = contents:"wal mart"
Wal Mart = contents:wal contents:mart

Notice all three are completely different queries.  The Wal-Mart one is  
excluding "mart" making it miss documents you expect.  The second one  
is a phrase query, which is basically what you're after.  The third one  
is matching any documents with "wal" or "mart" in them regardless of  
whether they are side-by-side.

Is this a bug?  Nah... just the nature of the QueryParser beast.  It  
would be a non-backwards-compatible change to change how QueryParser  
deals with a dash. That is the main issue here with it interpreting it  
as a NOT operator.  But it seems logical to me that it shouldn't do so  
when its mashed against a word like this and leave it to the analyzer  
to deal with.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message