lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: TokenStream and Token APIs
Date Mon, 13 Oct 2008 09:06:16 GMT

This looks good!

One question on back compatibility: currently, TokenStream.nextToken  
takes a Token arg in, and returns a Token back, such that the method  
is encouraged but not required to use the passed-in Token as its  
prototype.

You are adding a boolean nextToken() method, which then forces the  
reuse (which I think is good) but you need to ensure older TokenStream  
impls still work.  I guess this amounts to a default implementation of  
boolean nextToken() in the base TokenStream class.

Mike

Michael Busch wrote:

> Hi,
>
> I've been thinking about making the TokenStream and Token APIs more  
> flexible. E. g. for fields that don't store positions, the Token  
> doesn't need to have a positionIncrement or a payload. With flexible  
> indexing on the other hand, people might want to add custom  
> attributes to a Token that a consumer in the indexing chain could  
> use then.
>
> Of course it is possible to extend Token, because it is not final,  
> and add additional attributes to it. But then consumers of the  
> TokenStream must downcast every instance of the Token object when  
> they call next(Token).
>
> I was therefore thinking about a different TokenStream API:
>
>  public abstract class TokenStream {
>    public abstract boolean nextToken() throws IOException;
>
>    public abstract Token prototypeToken() throws IOException;
>
>    public void reset() throws IOException {}
>
>    public void close() throws IOException {}
>  }
>
> Furthermore Token itself would only keep the termBuffer logic and we  
> could introduce different interfaces, like:
>
>  public interface PayloadAttribute {
>    /**
>     * Returns this Token's payload.
>     */
>    public Payload getPayload();
>
>    /**
>     * Sets this Token's payload.
>     */
>    public void setPayload(Payload payload);
>  }
>
>  public interface PositionIncrementAttribute {
>    /** Set the position increment.  This determines the position of
>     *  this token relative to the previous Token in a
>     * {@link TokenStream}, used in phrase searching.
>     */
>    public void setPositionIncrement(int positionIncrement);
>
>    /** Returns the position increment of this Token.
>     * @see #setPositionIncrement
>     */
>    public int getPositionIncrement();
>  }
>
> A consumer, e. g. the DocumentsWriter, does not create a Token  
> instance itself anymore, but rather calls prototypeToken(). This  
> method returns a Token subclass which implements all desired  
> *Attribute interfaces.
>
> If a consumer is e. g. only interested in the positionIncrement and  
> Payload, it can consume the tokens like this:
>
>  public class Consumer {
>    public void consumeTokens(TokenStream ts) throws IOException {
>      Token token = ts.prototypeToken();
>
>      PayloadAttribute payloadSource = (PayloadAttribute) token;
>      PositionIncrementAttribute positionSource =
>                    (PositionIncrementAttribute) token;
>
>      while (ts.nextToken()) {
>        char[] term = token.termBuffer();
>        int termLength = token.termLength();
>        int positionIncrement = positionSource.getPositionIncrement();
>        Payload payload = payloadSource.getPayload();
>
>        // do something with the term, positionIncrement and payload
>      }
>    }
>  }
>
> Casting is now only done once after the prototype token was created.  
> Now if you want to add another consumer in the indexing chain and  
> realize that you want to add another attribute to the Token, then  
> you don't have to change this consumer. You only need to create  
> another Token subclass that implements the new attribute in addition  
> to the previous ones and can use it in the new consumer.
>
> I haven't tried to implement this yet and maybe there are things I  
> haven't thought about (like caching TokenFilters). I'd like to get  
> some feedback about these APIs first to see if this makes sense?
>
> Btw: if we think this (or another) approach to change these APIs  
> makes sense, then it would be good to change it for 3.0 when we can  
> break backwards compatibility. And then we should also rethink the  
> Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible  
> indexing!
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message