lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken / captureState / restoreState), cannot implement a "stop phrases filter"
Date Thu, 08 Oct 2009 13:52:04 GMT
restoreState only restores the token contents, not the complete stream. So
you cannot roll back the token stream (and this was also not possible with
the old API). The while loop at the end of you code is not working as you
exspect because of this. You may use CachingTokenFilter, which can be reset
and consumed again, as a source for further work.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Enrico Detoma [mailto:enrico.detoma@gmail.com]
> Sent: Thursday, October 08, 2009 4:42 PM
> To: java-user@lucene.apache.org
> Subject: Lucene 2.9.0 [PROBLEM] : TokenStream API (incrementToken /
> captureState / restoreState), cannot implement a "stop phrases filter"
> 
> Hi all,
> 
> I'm trying to implement a "stop phrases filter" with the new TokenStream
> API.
> 
> I would like to be able to peek into N tokens ahead, see if the current
> token + N subsequent tokens match a "stop phrase" (the set of stop phrases
> are saved in a HashSet), then discard all these tokens when they match a
> stop phrase, or keep them all if they don't match.
> 
> For this purpose I would like to use captureState() and then
> restoreState()
> to get back to the starting point of the stream.
> 
> I tried many combinations of these API. My last attempt is in the code
> below, which doesn't work.
> 
> 
> 
>     static private HashSet<String> m_stop_phrases = new HashSet<String>();
>     static private int m_max_stop_phrase_length = 0;
> ...
>     public final boolean incrementToken() throws IOException {
>         if (!input.incrementToken())
>             return false;
>         Stack<State> stateStack = new Stack<State>();
>         StringBuilder match_string_builder = new StringBuilder();
>         int skippedPositions = 0;
>         boolean is_next_token = true;
>         while (is_next_token && match_string_builder.length() <
> m_max_stop_phrase_length) {
>             if (match_string_builder.length() > 0)
>                 match_string_builder.append(" ");
>             match_string_builder.append(termAtt.term());
>             skippedPositions += posIncrAtt.getPositionIncrement();
>             stateStack.push(captureState());
>             is_next_token = input.incrementToken();
>             if (m_stop_phrases.contains(match_string_builder.toString()))
> {
>               // Stop phrase is found: skip the number of tokens
>               // without restoring the state
> 
> posIncrAtt.setPositionIncrement(posIncrAtt.getPositionIncrement() +
> skippedPositions);
>               return is_next_token;
>             }
>         }
>         // No stop phrase found: restore the stream
>         while (!stateStack.empty())
>             restoreState(stateStack.pop());
>         return true;
>     }
> 
> 
> Which is the correct direction I should look into to implement my "stop
> phrases" filter?
> 
> Thank you
> Regards
> Enrico


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message