lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From AHMET ARSLAN <iori...@yahoo.com>
Subject Re: Lucene Analyzer that can handle C++ vs C#
Date Fri, 11 Dec 2009 18:04:59 GMT

> Can someone please point me in the right direction.
> 
> We are creating an application that needs to beable to
> search on C++ and get
> back doc's that have C++ in it.  The StandardAnalyzer
> does not seem to index
> the "+", so a search for "C++" will bring back docs that
> contain, C++, C,
> C#, etc.....  The WhiteSpaceAnalyzer will index the
> "+", but if we have the
> term "C++." that is, if C++ is at the end of a sentence, it
> will index
> "C++." so a search for "C++" will not return the doc. 
> I have heard of maybe
> a CustomAnalyzer; however, it seems like there would
> actually need to be a
> CustomFilter/CustomTokenizer, I looked at:
>      - StandardAnalyzer.java
>      - StandardFilter.java
>      - StandardTokenizer.java
>      - StandardTokenizerImpl.java
>      - StandardTokenizerImpl.jflex
> 
> I would guess that the StandardTokenizer is where the
> changes would need to
> be made to allow the "+" character, but I am unclear as to
> how.
> 
> Any and all help is greatly appreciated.

One option is to modify StandardTokenizerImpl.jflex and generate CustomTokenizerImpl.java
so that it will recognize C++ and C# as one token. You need to write a new Tokenizer that
uses that CustomTokenizerImpl.java.

Other option can be to extend CharTokenizer. Modify the source code of LetterTokenizer : 

 @Override
  protected boolean isTokenChar(char c) {
    return Character.isLetter(c) || c=='+' || c=='#';
  }

Hope this helps.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message