Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 76263 invoked from network); 12 Oct 2008 06:14:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 Oct 2008 06:14:36 -0000 Received: (qmail 45166 invoked by uid 500); 12 Oct 2008 06:14:30 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 45115 invoked by uid 500); 12 Oct 2008 06:14:29 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 45106 invoked by uid 99); 12 Oct 2008 06:14:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 11 Oct 2008 23:14:29 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of serera@gmail.com designates 209.85.134.185 as permitted sender) Received: from [209.85.134.185] (HELO mu-out-0910.google.com) (209.85.134.185) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Oct 2008 06:13:25 +0000 Received: by mu-out-0910.google.com with SMTP id i10so1126351mue.5 for ; Sat, 11 Oct 2008 23:13:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=1NbTJ0vHjaZV7lJkNGrrGSCy3WKEDEfZzTbVdgnEIe4=; b=ELKIYVHifCUx37fSc4JJ0WSEK55BW4oJrQX9p3oPHbFsDpW0uH3Kjdfdfpbte1cEJE 0i4q355aEpLq/VDH6QLzxsNV0HEiXZZMvn2Y7fnLBt8K32eIvy8sISO586z7jnMGQcQM X9w64ypIln4VlnHXov49XLB13kE+IQjqtkqVk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=XMxzpxAqKnsYzl5WzmeapydJDZeFddyd6qmlPmGncKG52MFUdkApygWK3030MKYasS //M3mdh3cUniTNU//D2WpL9pB7+jchC2wUlSQ46ryL8K+RihxFF7djEoPebMV0hjsC2m DO3BCGV0VBBgqhFVNT3z7zaB6ypv/6AhtPgyw= Received: by 10.103.2.14 with SMTP id e14mr2481080mui.104.1223792023295; Sat, 11 Oct 2008 23:13:43 -0700 (PDT) Received: by 10.103.198.13 with HTTP; Sat, 11 Oct 2008 23:13:43 -0700 (PDT) Message-ID: <786fde50810112313j59932a4es2776f121a129e64c@mail.gmail.com> Date: Sun, 12 Oct 2008 08:13:43 +0200 From: "Shai Erera" To: java-dev@lucene.apache.org Subject: Re: TokenStream and Token APIs In-Reply-To: <48F137D6.30808@gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_56309_11888451.1223792023281" References: <48F137D6.30808@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_56309_11888451.1223792023281 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java templates then? Have the calling application pass in the Token template it wants to use and then the consumer does not need to cast anything ... BTW, what I didn't understand from you description is how does the indexing part know which attributes my Token supports? For example, let's say I create a Token which implements only position increments, no payload and perhaps some other custom attribute. I generate a TokenStream returning this Token type. How will Lucene's indexing mechanism know my Token supports only position increments and especially the custom attribute? What will it do with that custom attribute? Perhaps what you write below is linked to another thread (on flexible indexing maybe?) which I'm not aware of, so I'd appreciate if you can give me a reference. Shai On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch wrote: > Hi, > > I've been thinking about making the TokenStream and Token APIs more > flexible. E. g. for fields that don't store positions, the Token doesn't > need to have a positionIncrement or a payload. With flexible indexing on the > other hand, people might want to add custom attributes to a Token that a > consumer in the indexing chain could use then. > > Of course it is possible to extend Token, because it is not final, and add > additional attributes to it. But then consumers of the TokenStream must > downcast every instance of the Token object when they call next(Token). > > I was therefore thinking about a different TokenStream API: > > public abstract class TokenStream { > public abstract boolean nextToken() throws IOException; > > public abstract Token prototypeToken() throws IOException; > > public void reset() throws IOException {} > > public void close() throws IOException {} > } > > Furthermore Token itself would only keep the termBuffer logic and we could > introduce different interfaces, like: > > public interface PayloadAttribute { > /** > * Returns this Token's payload. > */ > public Payload getPayload(); > > /** > * Sets this Token's payload. > */ > public void setPayload(Payload payload); > } > > public interface PositionIncrementAttribute { > /** Set the position increment. This determines the position of > * this token relative to the previous Token in a > * {@link TokenStream}, used in phrase searching. > */ > public void setPositionIncrement(int positionIncrement); > > /** Returns the position increment of this Token. > * @see #setPositionIncrement > */ > public int getPositionIncrement(); > } > > A consumer, e. g. the DocumentsWriter, does not create a Token instance > itself anymore, but rather calls prototypeToken(). This method returns a > Token subclass which implements all desired *Attribute interfaces. > > If a consumer is e. g. only interested in the positionIncrement and > Payload, it can consume the tokens like this: > > public class Consumer { > public void consumeTokens(TokenStream ts) throws IOException { > Token token = ts.prototypeToken(); > > PayloadAttribute payloadSource = (PayloadAttribute) token; > PositionIncrementAttribute positionSource = > (PositionIncrementAttribute) token; > > while (ts.nextToken()) { > char[] term = token.termBuffer(); > int termLength = token.termLength(); > int positionIncrement = positionSource.getPositionIncrement(); > Payload payload = payloadSource.getPayload(); > > // do something with the term, positionIncrement and payload > } > } > } > > Casting is now only done once after the prototype token was created. Now if > you want to add another consumer in the indexing chain and realize that you > want to add another attribute to the Token, then you don't have to change > this consumer. You only need to create another Token subclass that > implements the new attribute in addition to the previous ones and can use it > in the new consumer. > > I haven't tried to implement this yet and maybe there are things I haven't > thought about (like caching TokenFilters). I'd like to get some feedback > about these APIs first to see if this makes sense? > > Btw: if we think this (or another) approach to change these APIs makes > sense, then it would be good to change it for 3.0 when we can break > backwards compatibility. And then we should also rethink the > Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible indexing! > > -Michael > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > > ------=_Part_56309_11888451.1223792023281 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline
In 3.0 you plan to move to Java 1.5, right? Couldn't you use the Java templates then? Have the calling application pass in the Token template it wants to use and then the consumer does not need to cast anything ...

BTW, what I didn't understand from you description is how does the indexing part know which attributes my Token supports? For example, let's say I create a Token which implements only position increments, no payload and perhaps some other custom attribute. I generate a TokenStream returning this Token type.
How will Lucene's indexing mechanism know my Token supports only position increments and especially the custom attribute? What will it do with that custom attribute?
Perhaps what you write below is linked to another thread (on flexible indexing maybe?) which I'm not aware of, so I'd appreciate if you can give me a reference.

Shai

On Sun, Oct 12, 2008 at 1:33 AM, Michael Busch <buschmic@gmail.com> wrote:
Hi,

I've been thinking about making the TokenStream and Token APIs more flexible. E. g. for fields that don't store positions, the Token doesn't need to have a positionIncrement or a payload. With flexible indexing on the other hand, people might want to add custom attributes to a Token that a consumer in the indexing chain could use then.

Of course it is possible to extend Token, because it is not final, and add additional attributes to it. But then consumers of the TokenStream must downcast every instance of the Token object when they call next(Token).

I was therefore thinking about a different TokenStream API:

 public abstract class TokenStream {
   public abstract boolean nextToken() throws IOException;

   public abstract Token prototypeToken() throws IOException;

   public void reset() throws IOException {}

   public void close() throws IOException {}
 }

Furthermore Token itself would only keep the termBuffer logic and we could introduce different interfaces, like:

 public interface PayloadAttribute {
   /**
    * Returns this Token's payload.
    */
   public Payload getPayload();

   /**
    * Sets this Token's payload.
    */
   public void setPayload(Payload payload);
 }

 public interface PositionIncrementAttribute {
   /** Set the position increment.  This determines the position of
    *  this token relative to the previous Token in a
    * {@link TokenStream}, used in phrase searching.
    */
   public void setPositionIncrement(int positionIncrement);

   /** Returns the position increment of this Token.
    * @see #setPositionIncrement
    */
   public int getPositionIncrement();
 }

A consumer, e. g. the DocumentsWriter, does not create a Token instance itself anymore, but rather calls prototypeToken(). This method returns a Token subclass which implements all desired *Attribute interfaces.

If a consumer is e. g. only interested in the positionIncrement and Payload, it can consume the tokens like this:

 public class Consumer {
   public void consumeTokens(TokenStream ts) throws IOException {
     Token token = ts.prototypeToken();

     PayloadAttribute payloadSource = (PayloadAttribute) token;
     PositionIncrementAttribute positionSource =
                   (PositionIncrementAttribute) token;

     while (ts.nextToken()) {
       char[] term = token.termBuffer();
       int termLength = token.termLength();
       int positionIncrement = positionSource.getPositionIncrement();
       Payload payload = payloadSource.getPayload();

       // do something with the term, positionIncrement and payload
     }
   }
 }

Casting is now only done once after the prototype token was created. Now if you want to add another consumer in the indexing chain and realize that you want to add another attribute to the Token, then you don't have to change this consumer. You only need to create another Token subclass that implements the new attribute in addition to the previous ones and can use it in the new consumer.

I haven't tried to implement this yet and maybe there are things I haven't thought about (like caching TokenFilters). I'd like to get some feedback about these APIs first to see if this makes sense?

Btw: if we think this (or another) approach to change these APIs makes sense, then it would be good to change it for 3.0 when we can break backwards compatibility. And then we should also rethink the Fieldable/AbstractField/Field/FieldInfos APIs for 3.0 and flexible indexing!

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


------=_Part_56309_11888451.1223792023281--