lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Busch <busch...@gmail.com>
Subject TokenStream and Token APIs
Date Sat, 11 Oct 2008 23:33:42 GMT
Hi,

I've been thinking about making the TokenStream and Token APIs more 
flexible. E. g. for fields that don't store positions, the Token doesn't 
need to have a positionIncrement or a payload. With flexible indexing on 
the other hand, people might want to add custom attributes to a Token 
that a consumer in the indexing chain could use then.

Of course it is possible to extend Token, because it is not final, and 
add additional attributes to it. But then consumers of the TokenStream 
must downcast every instance of the Token object when they call 
next(Token).

I was therefore thinking about a different TokenStream API:

   public abstract class TokenStream {
     public abstract boolean nextToken() throws IOException;

     public abstract Token prototypeToken() throws IOException;

     public void reset() throws IOException {}

     public void close() throws IOException {}
   }

Furthermore Token itself would only keep the termBuffer logic and we 
could introduce different interfaces, like:

   public interface PayloadAttribute {
     /**
      * Returns this Token's payload.
      */
     public Payload getPayload();

     /**
      * Sets this Token's payload.
      */
     public void setPayload(Payload payload);
   }

   public interface PositionIncrementAttribute {
     /** Set the position increment.  This determines the position of
      *  this token relative to the previous Token in a
      * {@link TokenStream}, used in phrase searching.
      */
     public void setPositionIncrement(int positionIncrement);

     /** Returns the position increment of this Token.
      * @see #setPositionIncrement
      */
     public int getPositionIncrement();
   }

A consumer, e. g. the DocumentsWriter, does not create a Token instance 
itself anymore, but rather calls prototypeToken(). This method returns a 
Token subclass which implements all desired *Attribute interfaces.

If a consumer is e. g. only interested in the positionIncrement and 
Payload, it can consume the tokens like this:

   public class Consumer {
     public void consumeTokens(TokenStream ts) throws IOException {
       Token token = ts.prototypeToken();

       PayloadAttribute payloadSource = (PayloadAttribute) token;
       PositionIncrementAttribute positionSource =
                     (PositionIncrementAttribute) token;

       while (ts.nextToken()) {
         char[] term = token.termBuffer();
         int termLength = token.termLength();
         int positionIncrement = positionSource.getPositionIncrement();
         Payload payload = payloadSource.getPayload();

         // do something with the term, positionIncrement and payload
       }
     }
   }

Casting is now only done once after the prototype token was created. Now 
if you want to add another consumer in the indexing chain and realize 
that you want to add another attribute to the Token, then you don't have 
to change this consumer. You only need to create another Token subclass 
that implements the new attribute in addition to the previous ones and 
can use it in the new consumer.

I haven't tried to implement this yet and maybe there are things I 
haven't thought about (like caching TokenFilters). I'd like to get some 
feedback about these APIs first to see if this makes sense?

Btw: if we think this (or another) approach to change these APIs makes 
sense, then it would be good to change it for 3.0 when we can break 
backwards compatibility. And then we should also rethink the 
Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible indexing!

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message