Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 944 invoked from network); 6 Feb 2004 01:19:41 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 6 Feb 2004 01:19:41 -0000 Received: (qmail 26710 invoked by uid 500); 6 Feb 2004 01:19:14 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 26665 invoked by uid 500); 6 Feb 2004 01:19:14 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 26643 invoked from network); 6 Feb 2004 01:19:14 -0000 Received: from unknown (HELO slc-xchg1.slc.mainstreamdata.com) (209.63.42.32) by daedalus.apache.org with SMTP; 6 Feb 2004 01:19:14 -0000 X-MimeOLE: Produced By Microsoft Exchange V6.0.6487.1 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: Newbie Phrase Query question Date: Thu, 5 Feb 2004 18:19:22 -0700 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Newbie Phrase Query question Thread-Index: AcPqzqFAzPoOJRR/SfSVqSw0z6j3QwBdGo4w From: "Scott Smith" To: "Lucene Users List" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Actually, I found your "QueryParser Rules" article the most useful. It explained a number of things that I had puzzled about. Query.toString() helped also. So, obvious in hindsight, an exact phrase match still goes through the tokenizer. If there are stop words or you're stemming or etc., you need to tokenize the phrase before trying to get an exact match. Clearly, that has implications for what "exact phrase match" means. The toString() told me that the quotes are handled by the queryParser. The weblucene cjk tokenizer works just fine with it and I didn't make any changes to it. The "bad" news is that after going through all of this, the code just started to work as expected. I'm not sure what I did to fix it. There is a minor issue I found that I think works as documented, but wonder why it's that way. If you enter a search string that's a hyphenated word such as "fred-bill" (w/o the quotes), the QueryParser generates a search string to find all documents with fred but w/o bill. I believe this is expected behavior based on the javadocs. The effect of this is that a hyphenated word gives unexpected results unless surrounded by quotes. Perhaps the syntax should have been "fred -bill" (space before the hyphen required) to indicate that you didn't want bill and that it's not a hyphenated word. Seems a tad more general. It's an issue for me because my application deals with hyphenated words a lot and I don't think my users would ever understand when quotes should be used and when they should not (most of them won't figure out how to use the "not" syntax). I can solve it by requiring the user to enter a space before the hyphen if they mean "not" and then have the search code automatically add the quotes for hyphenated words. It's just a little painful. Just a thought for 1.4. ;-) -----Original Message----- From: Erik Hatcher [mailto:erik@ehatchersolutions.com]=20 Sent: Tuesday, February 03, 2004 8:26 PM To: Lucene Users List Subject: Re: Newbie Phrase Query question The best suggestion I have is to look at the code in my first java.net=20 article (Intro Lucene) and borrow the Analyzer utility code to see what=20 happens to a sample string as it is analyzed. Then pass that same=20 string to QueryParser (along with the same analyzer) and see what the=20 Query.toString() returns. This should shed light=20 on the issue more clearly. Erik On Feb 3, 2004, at 10:01 PM, Scott Smith wrote: > I'm having problems searching for an exact match with a phrase.=20 > Essentially, I think my problem is that the tokenizer is tossing the=20 > double quotes around the phrase, tokenizing each word and so I end up=20 > with the document hit I want plus several more I don't (the latter=20 > having some of the words, but not exact matches). Here's the=20 > specifics. > > > First, I'm using the CJKTokenizer from WebLucene which I believe is a=20 > modified version of the stopword tokenizer enhanced to handle asian=20 > characters (that's according to the header; I don't think the asian=20 > characters have anything to do with my problem). > > The documents I need to search, for reasons related to the=20 > application, often end up with hyphenated words in critical places. =20 > For example, the original text to be indexed might be something like=20 > "this is Bill-Fred". > > When this is tokenized initially, I end up with two tokens "bill" and=20 > "fred" (the tokenizer converts to lower case; "this" and "is" are=20 > removed as stop words; the hyphen is removed by the tokenizer). So=20 > far so good. > > I pass the phrase I want an exact match on to a QueryParser in quotes=20 > (so "Bill-Fred" is the search string; quotes included). I watched the > output of the tokenizer from the query parser and it is clearly=20 > tossing the double quotes and tokenizing each word separately. It=20 > passes the words "bill" and "fred" as separate entities back to the=20 > QueryParser. Looking at the tokenizer code, I understand why. =20 > Obviously, that's why I end up with documents that contain the words=20 > even if they are not exact matches. > > Here's the question. I can modify the CJKTokenizer so that when it > sees > "Fred-Bill" it creates a single token that looks like "fred bill". > Would this now work? Is this the right thing to do? I realize this > means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably > live with that. > > However, it also seems like I now have a problem if the original text=20 > contains a quotation from someone that happens to be part of the=20 > document (i.e., the original text has double quotes in it). It seems=20 > like I need to ignore quotes for the initial index, but use them to=20 > build phrases when I'm tokenizing a search string in the QueryParser.=20 > Do I need two tokenizers? > > Does any of this make any sense? I'm not quite sure what the=20 > QueryParser wants to see to properly do a phrase match. Is=20 > QueryParser the wrong thing to be using here? Suggestions or=20 > comments? > > Scott > > --------------------------------------------------------------------- > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org > For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org