lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Lucene Seaches VS. Relational database Queries
Date Thu, 13 Apr 2006 17:30:27 GMT
Warning: I'm quite new to lucene, so this may not be very accurate....

What analyzers are you using for indexing and searching? StandardAnalyzer
(like in most of the examples)? Because it looks like you're having a
tokenizer problem. That is, when you index "Assistant producer", you
actually index two tokens, "assistant" and "producer". This is the default
behavior for the stock analyzers, see page 119 of "Lucene in Action". So,
when you search your index, you hit on "producer" in "assistant producer".

A quick test of this would be to munge your input and search terms to
include an underscore for terms like "assistant producer" (that is, index
and search for "assistant_producer"), AND use WhitespaceAnalyzer for BOTH
indexing and searching. This should (if I understand your problem
correctly), fix the example above. If you use any of the other stock
analyzers (Simple, Stop or Standard), they'll split your terms at the
underscore. And if you use one analyzer for indexing and a different one for
searching, the results are, er, interesting and confusing.

Now, your query select count(distinct Crew_ID) from Crew_TItles where
Title="Producer" should produce the same results as searching your index for
producer and the Hits object should contain 3 docs.

P.S. Watch capitalization. the StandardAnalyzer and StopAnalyzer both
lowercase automatically, but WhitespaceAnalyzer does NOT.

If that works, then you will probably want to create your own Analyzer that
recognizes your special terms and deals with them as you see fit. This works
well with PerFieldAnalyzerWrapper to allow you to use your own special
analyzer for one field in your index and other analyzers for other fields,
as appropriate.

Like I say, I'm new here, but this is a possibility I thought I'd mention.

Erick

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message