lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From heikki <tropic...@gmail.com>
Subject Question about custom Analyzer
Date Thu, 04 Nov 2010 09:07:23 GMT
hello Lucene list,

I have a question about a custom Analyzer we're trying to write. The
intention is that it tokenizes on whitespace, and abstracts over
upper/lowercase and accented characters. It is used both when indexing
documents, and before creating lucene queries from search terms.

I have 2 implementations. The first one seems to work only correctly if the
index is rebuilt after we add something to the index. If we do not rebuild,
the newly added document is not found when you search for it. I've no idea
what could cause this behaviour. I'm posting its code below, called "Attempt
A".

The second implementation seems to work better. Using it, newly indexed
documents are immediately findable, without first rebuilding the index. It
also seems to abstract over upper/lowercase, and in my colleague's tests
(but not in mine) seems to abstract over accented characters. I'm posting
its code below, called "Attempt B".

We do not understand why our first implementation "Attempt A" behaves like
it does. We also do not understand why the second implementation "Attempt B"
improves on that, and whether that implementation actually fulfills our
goals (seeing the different test results we got).

So I'd very much appreciate it if someone could help us understand this, and
tell us if we're taking the right approach here to achieve this seemingly
simple goal.


Kind regards
Heikki Doeleman

===============================================
Attempt A :

public final class GeoNetworkAnalyzer extends Analyzer {

         @Override
         public TokenStream tokenStream(String fieldName, Reader reader) {
             TokenStream ts = new WhitespaceTokenizer(reader);
             ts = new LowerCaseFilter(ts);
             ts = new ASCIIFoldingFilter(ts);
             return ts;
         }

         @Override
         public TokenStream reusableTokenStream(String fieldName, Reader
reader) throws IOException {
           TokenStream ts = (TokenStream) getPreviousTokenStream();
           if (ts == null) {
             ts = tokenStream(null, reader);
             setPreviousTokenStream(ts);
           }
           else {
             ts.reset();
           }
           return ts;
         }
}

=================================================
Attempt B :

public final class GeoNetworkAnalyzer extends Analyzer {

    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
        return new ASCIIFoldingFilter(new LowerCaseFilter(new
WhitespaceTokenizer(reader)));
    }

    @Override
    public TokenStream reusableTokenStream(String fieldName, Reader reader)
throws IOException {
        Tokenizer tokenizer = (Tokenizer) getPreviousTokenStream();
        if (tokenizer == null) {
            tokenizer = new WhitespaceTokenizer(reader);
            setPreviousTokenStream(tokenizer);
        } else
            tokenizer.reset(reader);
        return tokenizer;
    }
}

=================================================

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message