lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: UN_TOKENIZED and StandardAnalyzer
Date Fri, 06 Apr 2007 14:25:43 GMT
Really, really, really get a copy of Luke. Really. Use it to open
your index and run experimental queries, especially to see
how they get rewritten (but be sure to pick the appropriate
analyzer).

Google "lucene luke". Really, really get a copy. It'll help you
make MUCH faster progress than waiting for someone to
get around to answering posts here <G>....

Then use query.toString() to see how things pars in your programs
(or use Luke's search tab to see similar).

If you do, you'll see that parsing the string "winter sport" becomes
CATEGORY:winter CATEGORY:sport

using StandardAnalyzer with the default operator OR will not
match an untokenized field of "winter category".
AND won't work either, it's an UN_TOKENIZED field

Using KeywordAnzlyer gives
CATEGORY:winter CATEGORY:sport
This does NOT match the document
I indexed UN_TOKENIZED 'winter sport'. What the heck?

But using KeywordAnalyzer with
"\"winter sport\""

That is, where the double quotes are port of the query and using
KeywordAnalyzer gives
CATEGORY:winter sport
and produces the correct match on my UN_TOKENIZED field. What's
happening here is the double quotes cause the parser to pass both
words as a single token rather than two tokens.

This has been a source of some confusion on the list, there
is a very long thread you'll find if you search the archives
for KeywordAnalyzer.


Best
Erick

On 4/6/07, Roberto Fonti <roberto.fonti@buongiorno.com> wrote:
>
> Hi All,
> I'm indexing categories with this code:
>
> for (Category category : item.getCategories()) {
>     lucene_doc.add(new Field(
>         "CATEGORY",
>         category.getName(),
>         Field.Store.NO,
>         Field.Index.UN_TOKENIZED));
> }
>
> And searching using the query:
>
> String query = "CATEGORY:("+category.getName()+")";
>
> I've configured to use the StandardAnalyzer both in the IndexWriter for
> the QueryParser.
>
> Everything goes fine BUT with categories that contains whitespaces (or
> other chars that get tokenized).
>
> * If category is "sport" - ok, I get the result from the search
> * If category is "winter sport" - I get no result from search
>
> I've tried with a number of search syntax:
> +CATEGORY:"winter sport"
> +CATEGORY:winter +CATEGORY:sport
> +CATEGORY:(winter sport)
> and other...
> but none of them work.
>
> What's wrong with that?
> By the way, using the KeywordAnalyzer it works, but it is not the
> correct analyzer for my application.
> Shouldn't the Analyzer be ignored for a Field.Index.UN_TOKENIZED field?
>
> Thanks,
> Roberto
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message