lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Detoma <>
Subject Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"
Date Thu, 08 Oct 2009 13:42:22 GMT
Hi all,

I'm trying to implement a "stop phrases filter" with the new TokenStream

I would like to be able to peek into N tokens ahead, see if the current
token + N subsequent tokens match a "stop phrase" (the set of stop phrases
are saved in a HashSet), then discard all these tokens when they match a
stop phrase, or keep them all if they don't match.

For this purpose I would like to use captureState() and then restoreState()
to get back to the starting point of the stream.

I tried many combinations of these API. My last attempt is in the code
below, which doesn't work.

    static private HashSet<String> m_stop_phrases = new HashSet<String>();
    static private int m_max_stop_phrase_length = 0;
    public final boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        Stack<State> stateStack = new Stack<State>();
        StringBuilder match_string_builder = new StringBuilder();
        int skippedPositions = 0;
        boolean is_next_token = true;
        while (is_next_token && match_string_builder.length() <
m_max_stop_phrase_length) {
            if (match_string_builder.length() > 0)
                match_string_builder.append(" ");
            skippedPositions += posIncrAtt.getPositionIncrement();
            is_next_token = input.incrementToken();
            if (m_stop_phrases.contains(match_string_builder.toString())) {
              // Stop phrase is found: skip the number of tokens
              // without restoring the state

posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() +
              return is_next_token;
        // No stop phrase found: restore the stream
        while (!stateStack.empty())
        return true;

Which is the correct direction I should look into to implement my "stop
phrases" filter?

Thank you

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message