lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject Re: Lucene Analyzer that can handle C++ vs C#
Date Fri, 11 Dec 2009 18:04:59 GMT

> Can someone please point me in the right direction.
> We are creating an application that needs to beable to
> search on C++ and get
> back doc's that have C++ in it.  The StandardAnalyzer
> does not seem to index
> the "+", so a search for "C++" will bring back docs that
> contain, C++, C,
> C#, etc.....  The WhiteSpaceAnalyzer will index the
> "+", but if we have the
> term "C++." that is, if C++ is at the end of a sentence, it
> will index
> "C++." so a search for "C++" will not return the doc. 
> I have heard of maybe
> a CustomAnalyzer; however, it seems like there would
> actually need to be a
> CustomFilter/CustomTokenizer, I looked at:
>      -
>      -
>      -
>      -
>      - StandardTokenizerImpl.jflex
> I would guess that the StandardTokenizer is where the
> changes would need to
> be made to allow the "+" character, but I am unclear as to
> how.
> Any and all help is greatly appreciated.

One option is to modify StandardTokenizerImpl.jflex and generate
so that it will recognize C++ and C# as one token. You need to write a new Tokenizer that
uses that

Other option can be to extend CharTokenizer. Modify the source code of LetterTokenizer : 

  protected boolean isTokenChar(char c) {
    return Character.isLetter(c) || c=='+' || c=='#';

Hope this helps.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message