lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Sun, 03 Sep 2006 23:53:53 GMT

I haven't really been following this thread, but it's gotten so long
i got interested.

from whta i can tell skimming the discussion so far, it seems like the
biggest confusion is about the definition of a "phrase" and what analyzers
do with "quote" characters and what the QueryParser does with "quote"
charcters -- when ultimately you don't seem to really care about "phrases"
in a textual searching sense; nor do you seem to care about any of the
"features" of the QueryParser.

it seems that what you care about is:

 1) making documents, and adding a list of "text chunks" to those
    documents (what you've been calling phrases)
 2) you then want to be able to search for "almost-exact" matches on those
    "text chunks" ... these matches should be "exactish" because you don't
    want partial matches based on white spaces, or splitting on hyphens,
    but they shouldn't be truely exact because you want some simple

: actually would like to "normalize" a phrase (spaces) or a hyphenated word or
: an underscored word to the same value -- e.g. MS-WORD or ms_WORd or "MS
: Word" --> ms_word. which case, you should:
 a) write yourself an analyzer which does no "tokenizing" (ie: each input
    Field value generates a single token) but does the normalization you
 b) use this Analyzer when you add the fields to your documents, even
    though you don't want *real* tokenization, add make the field type
    TOKENIZED so your analyzer gets used.
 c) when you get some text input to serach on, pass it to the same
    Analyzer, take the Token you get back and manualy construct a
    TermQuery out of it for the neccessary field.

...that's it.  that's all she wrote -- don't even look in QueryParser's
general direction, at all.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message