Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Newbie Phrase Query question
Date: Thu, 5 Feb 2004 18:19:22 -0700
Message-ID: 
 <A60AC49BA89F4F4D9D86F2B7519111E8040B52@dilbert.mainstreamdata.com>
Thread-Topic: Newbie Phrase Query question
Thread-Index: AcPqzqFAzPoOJRR/SfSVqSw0z6j3QwBdGo4w
From: "Scott Smith" <ssmith@mainstreamdata.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>

Actually, I found your "QueryParser Rules" article the most useful.  It
explained a number of things that I had puzzled about.  Query.toString()
helped also.

So, obvious in hindsight, an exact phrase match still goes through the
tokenizer.  If there are stop words or you're stemming or etc., you need
to tokenize the phrase before trying to get an exact match.  Clearly,
that has implications for what "exact phrase match" means.

The toString() told me that the quotes are handled by the queryParser.
The weblucene cjk tokenizer works just fine with it and I didn't make
any changes to it.

The "bad" news is that after going through all of this, the code just
started to work as expected.  I'm not sure what I did to fix it.

There is a minor issue I found that I think works as documented, but
wonder why it's that way.  If you enter a search string that's a
hyphenated word such as "fred-bill" (w/o the quotes), the QueryParser
generates a search string to find all documents with fred but w/o bill.
I believe this is expected behavior based on the javadocs.  The effect
of this is that a hyphenated word gives unexpected results unless
surrounded by quotes.  Perhaps the syntax should have been "fred -bill"
(space before the hyphen required) to indicate that you didn't want bill
and that it's not a hyphenated word.  Seems a tad more general.  It's an
issue for me because my application deals with hyphenated words a lot
and I don't think my users would ever understand when quotes should be
used and when they should not (most of them won't figure out how to use
the "not" syntax).  I can solve it by requiring the user to enter a
space before the hyphen if they mean "not" and then have the search code
automatically add the quotes for hyphenated words.  It's just a little
painful.  Just a thought for 1.4. ;-)

-----Original Message-----
From: Erik Hatcher [mailto:erik@ehatchersolutions.com]=20
Sent: Tuesday, February 03, 2004 8:26 PM
To: Lucene Users List
Subject: Re: Newbie Phrase Query question


The best suggestion I have is to look at the code in my first java.net=20
article (Intro Lucene) and borrow the Analyzer utility code to see what=20
happens to a sample string as it is analyzed.  Then pass that same=20
string to QueryParser (along with the same analyzer) and see what the=20
Query.toString(<default field name>) returns.  This should shed light=20
on the issue more clearly.

	Erik


On Feb 3, 2004, at 10:01 PM, Scott Smith wrote:

> I'm having problems searching for an exact match with a phrase.=20
> Essentially, I think my problem is that the tokenizer is tossing the=20
> double quotes around the phrase, tokenizing each word and so I end up=20
> with the document hit I want plus several more I don't (the latter=20
> having some of the words, but not exact matches).  Here's the=20
> specifics.
>
>
> First, I'm using the CJKTokenizer from WebLucene which I believe is a=20
> modified version of the stopword tokenizer enhanced to handle asian=20
> characters (that's according to the header; I don't think the asian=20
> characters have anything to do with my problem).
>
> The documents I need to search, for reasons related to the=20
> application, often end up with hyphenated words in critical places. =20
> For example, the original text to be indexed might be something like=20
> "this is Bill-Fred".
>
> When this is tokenized initially, I end up with two tokens "bill" and=20
> "fred" (the tokenizer converts to lower case;  "this" and "is" are=20
> removed as stop words; the hyphen is removed by the tokenizer).  So=20
> far so good.
>
> I pass the phrase I want an exact match on to a QueryParser in quotes=20
> (so "Bill-Fred" is the search string; quotes included).  I watched the

> output of the tokenizer from the query parser and it is clearly=20
> tossing the double quotes and tokenizing each word separately.  It=20
> passes the words "bill" and "fred" as separate entities back to the=20
> QueryParser. Looking at the tokenizer code, I understand why. =20
> Obviously, that's why I end up with documents that contain the words=20
> even if they are not exact matches.
>
> Here's the question.  I can modify the CJKTokenizer so that when it
> sees
> "Fred-Bill" it creates a single token that looks like "fred bill".
> Would this now work?  Is this the right thing to do?  I realize this
> means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably
> live with that.
>
> However, it also seems like I now have a problem if the original text=20
> contains a quotation from someone that happens to be part of the=20
> document (i.e., the original text has double quotes in it).  It seems=20
> like I need to ignore quotes for the initial index, but use them to=20
> build phrases when I'm tokenizing a search string in the QueryParser.=20
> Do I need two tokenizers?
>
> Does any of this make any sense?  I'm not quite sure what the=20
> QueryParser wants to see to properly do a phrase match.  Is=20
> QueryParser the wrong thing to be using here?  Suggestions or=20
> comments?
>
> Scott
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org