lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kalvir Sandhu" <k...@kalv.co.uk>
Subject Re: Query Analyzer Issue
Date Fri, 31 Aug 2007 19:06:35 GMT
You should test your Analyzer to confirm what tokens are being produced. You
can do this by using a helper class, to save time there is one written in
the Lucene in Action book called AnalyzerUtils, you should be able to get it
out of the download of sourcecode from the book here:
http://www.lucenebook.com/LuceneInAction.zip

It's very easy to use. You give it an analyzer and some text and out will
pop the tokens.

You can also use Luke:
http://www.getopt.org/luke/ - To diagnose your index. Very useful tool. You
can see what terms you have stored against what document.

On 8/31/07, Harini Raghavan <harini.raghavan@insideview.com> wrote:
>
> Hi Everyone,
>
> I am facing some strange behaviour with Analyzers. I am using
> SimpleAnalyzer
> for some fields in my Compass entity, but I also wrote a custom Analyzer
> that is slightly different from the SimpleAnalyzer as I wanted to allow
> even
> letters and digits in company name column.
> So custom analyzer CompanyNameAnalyzer uses the following tokenizer.
>
> public class CompanyNameTokenizer extends CharTokenizer {
>   /** Construct a new CompanyNameTokenizer. */
>   public CompanyNameTokenizer(Reader in) {
>     super(in);
>   }
>
>   protected char normalize(char c) {
>     return Character.toLowerCase(c);
>   }
>
>   /** Collects only characters which are numbers or letters */
>   protected boolean isTokenChar(char c) {
>     Character ch = new Character(c);
>     // Exclude @ special character while tokenizing
>     if(Character.isLetterOrDigit(c) || ch.equals('@'))
>         return true;
>     else
>         return false;
>   }
> }
>
> But for some reason when I search for +companyName:good +companyName:will,
> the word 'will' is ignored in the search, I get results that match only
> good. I guess this means that 'will' is being stripped off as it is a stop
> word. I don't understand why this should happen because the custom
> Analyzer
> I am using does not use the StopFilter. So why should it filter the stop
> words?
>
> I tried looking at the Lucene source too, but no luck. Any suggestions
> would
> be appreciated.
>
> Thanks,
> Harini
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message