lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Newbie Phrase Query question
Date Wed, 04 Feb 2004 03:25:57 GMT
The best suggestion I have is to look at the code in my first java.net 
article (Intro Lucene) and borrow the Analyzer utility code to see what 
happens to a sample string as it is analyzed.  Then pass that same 
string to QueryParser (along with the same analyzer) and see what the 
Query.toString(<default field name>) returns.  This should shed light 
on the issue more clearly.

	Erik


On Feb 3, 2004, at 10:01 PM, Scott Smith wrote:

> I'm having problems searching for an exact match with a phrase.
> Essentially, I think my problem is that the tokenizer is tossing the
> double quotes around the phrase, tokenizing each word and so I end up
> with the document hit I want plus several more I don't (the latter
> having some of the words, but not exact matches).  Here's the 
> specifics.
>
>
> First, I'm using the CJKTokenizer from WebLucene which I believe is a
> modified version of the stopword tokenizer enhanced to handle asian
> characters (that's according to the header; I don't think the asian
> characters have anything to do with my problem).
>
> The documents I need to search, for reasons related to the application,
> often end up with hyphenated words in critical places.  For example, 
> the
> original text to be indexed might be something like "this is 
> Bill-Fred".
>
> When this is tokenized initially, I end up with two tokens "bill" and
> "fred" (the tokenizer converts to lower case;  "this" and "is" are
> removed as stop words; the hyphen is removed by the tokenizer).  So far
> so good.
>
> I pass the phrase I want an exact match on to a QueryParser in quotes
> (so "Bill-Fred" is the search string; quotes included).  I watched the
> output of the tokenizer from the query parser and it is clearly tossing
> the double quotes and tokenizing each word separately.  It passes the
> words "bill" and "fred" as separate entities back to the QueryParser.
> Looking at the tokenizer code, I understand why.  Obviously, that's why
> I end up with documents that contain the words even if they are not
> exact matches.
>
> Here's the question.  I can modify the CJKTokenizer so that when it 
> sees
> "Fred-Bill" it creates a single token that looks like "fred bill".
> Would this now work?  Is this the right thing to do?  I realize this
> means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably
> live with that.
>
> However, it also seems like I now have a problem if the original text
> contains a quotation from someone that happens to be part of the
> document (i.e., the original text has double quotes in it).  It seems
> like I need to ignore quotes for the initial index, but use them to
> build phrases when I'm tokenizing a search string in the QueryParser.
> Do I need two tokenizers?
>
> Does any of this make any sense?  I'm not quite sure what the
> QueryParser wants to see to properly do a phrase match.  Is QueryParser
> the wrong thing to be using here?  Suggestions or comments?
>
> Scott
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message