lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From maxSchlein <m_schl...@hotmail.com>
Subject Re: Lucene Analyzer that can handle C++ vs C#
Date Thu, 24 Dec 2009 17:05:09 GMT

Here is the solution.  I used a CustomAnalyzer that calls CustomFilter.  

Easy enough, but now if I want to use the current version of lucene, 3.0
these methods are no longer there.  TokenStream.next() or
TokenStream.next(Token).  In 2.9.0 these methods were deprecated as are
Token.setTermText() and Token.termText().  The newer versions say to use,
incrementToken() and AttributeSource APIs.  But I cannot find much help
using these in this way.  Any help again is appreciated.

Merry Christmas too.

public class CustomAnalyzer extends Analyzer
{
    @Override
    public TokenStream tokenStream(final String fieldName, final Reader
reader)
    {
        TokenStream ts = new WhitespaceTokenizer(reader);
        ts = new StopFilter(false, ts, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        ts = new LowerCaseFilter(ts);
        ts = new CustomFilter(ts);

        return ts;
    }

}

public class CustomFilter extends TokenFilter
{
    protected CustomFilter(TokenStream tokenStream)
    {
        super(tokenStream);
    }
    @Override
    public Token next(final Token reusableToken) throws IOException
    {
        Token nextToken = input.next(reusableToken);
        
        if(nextToken != null)
        {
           
nextToken.setTermText(nextToken.termText().replaceAll(":|,|\\(|\\)|“|~|;|&|\\.",""));
        }
        return nextToken;
    }
}



maxSchlein wrote:
> 
> Can someone please point me in the right direction.
> 
> We are creating an application that needs to beable to search on C++ and
> get
> back doc's that have C++ in it.  The StandardAnalyzer does not seem to
> index
> the "+", so a search for "C++" will bring back docs that contain, C++, C,
> C#, etc.....  The WhiteSpaceAnalyzer will index the "+", but if we have
> the
> term "C++." that is, if C++ is at the end of a sentence, it will index
> "C++." so a search for "C++" will not return the doc.  I have heard of
> maybe
> a CustomAnalyzer; however, it seems like there would actually need to be a
> CustomFilter/CustomTokenizer, I looked at:
>      - StandardAnalyzer.java
>      - StandardFilter.java
>      - StandardTokenizer.java
>      - StandardTokenizerImpl.java
>      - StandardTokenizerImpl.jflex
> 
> I would guess that the StandardTokenizer is where the changes would need
> to
> be made to allow the "+" character, but I am unclear as to how.
> 
> Any and all help is greatly appreciated.
> 
> Going thru all the documents, stripping out "+" for the word "plus" is not
> really an option for us. 
> 

-- 
View this message in context: http://old.nabble.com/Lucene-Analyzer-that-can-handle-C%2B%2B-vs-C--tp26748041p26915539.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message