lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Shane <sha...@LEXUM.UMontreal.CA>
Subject Re: Lucene 2.9.0-rc2 [PROBLEM] : TokenStream API (incrementToken / AttributeSource), cannot implement a LookaheadTokenFilter.
Date Thu, 03 Sep 2009 15:55:13 GMT
Uwe Schindler wrote:
> There may be a problem that you may not want to restore the peek token into
> the TokenFilter's attributes itsself. It looks like you want to have a Token
> instance returned from peek, but the current Stream should not reset to this
> Token (you only want to "look" into the next Token and then possibly do
> something special with the current Token). To achive this, there is a method
> cloneAttributes() in TokenStream, that creates a new AttributeSource with
> same attribute types, which is independent from the cloned one. You can then
> use clone.getAttribute(TermAttribute.class).term() or similar to look into
> the next token. But creating this new clone is costy, so you may also create
> it once and reuse. In the peek method, you simply copy the state of this to
> the cloned attributesource.
>
> It's a bit complicated but should work somehow. Tell me if you need more
> help. Maybe you should provide us with some code, what you want to do with
> the TokenFilter.
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>   
Humm... I looked at captureState() and restoreState() and it doesnt seem 
like it would work in my scenario.

I'd like the LookAheadFilter to be able to peek() several tokens forward 
and they can have different attributes, so I don't think I should assume 
I can restoreState() safely.

Here is an application for the filter, lets say I want to recognize 
abbreviations (like S.C.R.) at the token level. I'd need to be able to 
peek() a few tokens forward to make sure S.C.R. is an abbreviation and 
not simply the end of a sentence.

So the user should be able to peek() a number of token forward before 
returning to usual behavior.

Here is the implementation I had in mind (untested yet because of a 
StackOverflow) :

public class LookaheadTokenFilter extends TokenFilter {
    /** List of tokens that were peeked but not returned with next. */
    LinkedList<AttributeSource> peekedTokens = new 
LinkedList<AttributeSource>();

    /** The position of the next character that peek() will return in 
peekedTokens */
    int peekPosition = 0;

    public LookaheadTokenFilter(TokenStream input) {
        super(input);
    }
 
    public boolean peekIncrementToken() throws IOException {
        if (this.peekPosition >= this.peekedTokens.size()) {
            if (this.input.incrementToken() == false) {
                return false;
            }
           
            this.peekedTokens.add(cloneAttributes());           
            this.peekPosition = this.peekedTokens.size();
            return true;
        }
        
        this.peekPosition++;       
        return true;
    }
   
    @Override
    public boolean incrementToken() throws IOException {
        reset();
       
        if (this.peekedTokens.isEmpty() == false) {
            this.peekedTokens.removeFirst();
        }
       
        if (this.peekedTokens.isEmpty() == false) {
            return true;
        }
       
        return super.incrementToken();
    }
       
    @Override
    public void reset() {
        this.peekPosition = 0;
    }   
   

    //Overloaded methods...
   
    public Attribute getAttribute(Class attClass) {
        if (this.peekedTokens.size() > 0) {
            return 
this.peekedTokens.get(this.peekPosition).getAttribute(attClass);
        }       
        return super.getAttribute(attClass);
    }
   
    //Overload all these just like getAttribute() ...
    public Iterator<?> getAttributeClassesIterator() ...
    public AttributeFactory getAttributeFactory() ...
    public Iterator getAttributeImplsIterator() ...
    public Attribute addAttribute(Class attClass) ...
    public void addAttributeImpl(AttributeImpl att) ...
    public State captureState() ...
    public void clearAttributes() ...
    public AttributeSource cloneAttributes() ...
    public boolean hasAttribute(Class attClass) ...
    public boolean hasAttributes() ...
    public void restoreState(State state) ...                     
}


Now the problem I have is that the below code triggers an evil 
StackOverflow because I'm overriding incrementToken() and calling 
super.incrementToken() which will loop back because of this :

public boolean incrementToken() throws IOException {
    assert tokenWrapper != null;
   
    final Token token;
    if (supportedMethods.hasReusableNext) {
      token = next(tokenWrapper.delegate);
    } else {
      assert supportedMethods.hasNext;
      token = next(); <----- Lucene calls next();
    }
    if (token == null) return false;
    tokenWrapper.delegate = token;
    return true;
  }

which then calls :

public Token next() throws IOException {
    if (tokenWrapper == null)
      throw new UnsupportedOperationException("This TokenStream only 
supports the new Attributes API.");
   
    if (supportedMethods.hasIncrementToken) {
      return incrementToken() ? ((Token) tokenWrapper.delegate.clone()) 
: null; <--- incrementToken() gets called
    } else {
      assert supportedMethods.hasReusableNext;
      final Token token = next(tokenWrapper.delegate);
      if (token == null) return null;
      tokenWrapper.delegate = token;
      return (Token) token.clone();
    }
  }

and hasIncrementToken is true because I overloaded incrementToken();

 MethodSupport(Class clazz) {
    hasIncrementToken = isMethodOverridden(clazz, "incrementToken", 
METHOD_NO_PARAMS);
    hasReusableNext = isMethodOverridden(clazz, "next", METHOD_TOKEN_PARAM);
    hasNext = isMethodOverridden(clazz, "next", METHOD_NO_PARAMS);
}

Seems like a "catch-22". From what I understand, if I override 
incrementToken() I should not call super.incrementToken()????

Daniel S.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message