lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Miller" <markrmil...@gmail.com>
Subject Re: Phrase search using quotes -- special Tokenizer
Date Tue, 05 Sep 2006 18:36:07 GMT
Some info to help you on you're journey :)

1. If you add a field as untokenized then it will not be analyzed when added
to the index. However, QueryParser will not know that this happened and will
tokenize queries on that field.

2. The solution that Hoss has explained to you is to leave the default quote
handling in place. The default quote handling is this:

On Indexing: the analyzers ditch all quotes. As far as the index is concered
they are of no value...postion increments are used instead.

Searching with QueryParser: when the QueryParser detects something in
quotes, it takes whats between the quotes and passes that to
getFieldQuery(). GetFieldQuery than anaylzes the quote chunk sans the
quotes. Stop words are removed, stemming is performed, etc depending on your
analyzer. GetFieldQuery sees that multiple tokens came out of the analyzer
and that the positions between tokens indicate that you are going for a
phrase search. A phrase search is generated. A phrase search with stopwords
removed has interesting sloppy matching. A phrase search can also match out
of order given enough slop. This is normally fine behavior for most
applications I can think of. You need to consider if this is fine behavior
for you. You first mentioned that you only want exact matches to be made on
quoted searches...that you want no stop words removed etc. If there is some
reason you really need this (I don't see it myself) then use the method I
gave you. I would think you should be fine with the normal behavior, but
then I don't know why you asked about this to begin with.

3. If you are mixing quoted data with non-quoted data, a per-field analyzer
won't be of much help. The quoted and unquoted data will be in the save
field I assume. Are you separating the quoted stuff from the non-quoted and
putting them in separate fields?


- Mark

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message