lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Smith" <>
Subject Newbie Phrase Query question
Date Wed, 04 Feb 2004 03:01:31 GMT
I'm having problems searching for an exact match with a phrase.
Essentially, I think my problem is that the tokenizer is tossing the
double quotes around the phrase, tokenizing each word and so I end up
with the document hit I want plus several more I don't (the latter
having some of the words, but not exact matches).  Here's the specifics.

First, I'm using the CJKTokenizer from WebLucene which I believe is a
modified version of the stopword tokenizer enhanced to handle asian
characters (that's according to the header; I don't think the asian
characters have anything to do with my problem).  

The documents I need to search, for reasons related to the application,
often end up with hyphenated words in critical places.  For example, the
original text to be indexed might be something like "this is Bill-Fred".

When this is tokenized initially, I end up with two tokens "bill" and
"fred" (the tokenizer converts to lower case;  "this" and "is" are
removed as stop words; the hyphen is removed by the tokenizer).  So far
so good.

I pass the phrase I want an exact match on to a QueryParser in quotes
(so "Bill-Fred" is the search string; quotes included).  I watched the
output of the tokenizer from the query parser and it is clearly tossing
the double quotes and tokenizing each word separately.  It passes the
words "bill" and "fred" as separate entities back to the QueryParser.
Looking at the tokenizer code, I understand why.  Obviously, that's why
I end up with documents that contain the words even if they are not
exact matches.

Here's the question.  I can modify the CJKTokenizer so that when it sees
"Fred-Bill" it creates a single token that looks like "fred bill".
Would this now work?  Is this the right thing to do?  I realize this
means that I'd hit on "Fred-Bill" and "Fred Bill", but I can probably
live with that.  

However, it also seems like I now have a problem if the original text
contains a quotation from someone that happens to be part of the
document (i.e., the original text has double quotes in it).  It seems
like I need to ignore quotes for the initial index, but use them to
build phrases when I'm tokenizing a search string in the QueryParser.
Do I need two tokenizers?

Does any of this make any sense?  I'm not quite sure what the
QueryParser wants to see to properly do a phrase match.  Is QueryParser
the wrong thing to be using here?  Suggestions or comments?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message