lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: new TokenStream api Question
Date Tue, 28 Apr 2009 11:11:03 GMT
Hi Michael,
Sure, the Interfaces are solution to this. They define what Lucene core expects from these
entities and gives freedom to people to provide any implementation they wish. E.g.  users
that do not need Offset information, can just provide dummy implementation that returns constants...

The only problem with Interfaces is back compatibility curse :)  

 Attribute Offset is simple enough entity, so I do not believe there is a need ever to change
an interface 
Term is just char[] with offset/length , the same. 

Having really simple (and keeping them simple)  concepts behind  makes Interfaces possible...
I see no danger. But as said, the concepts behind must remain simple.

And by the way, I like the new API.  

Cheers, Eks

From: Michael Busch <>
Sent: Tuesday, 28 April, 2009 10:22:45
Subject: Re: new TokenStream api Question

Hi Eks Dev,

I actually started experimenting with changing the new API slightly to overcome one drawback:
with the variables now distributed over various Attribute classes (vs. being in a single class
Token previously), cloning a "Token" (i.e. calling captureState()) is more expensive. This
slows down the CachingTokenFilter and Tee/Sink-TokenStreams.

So I was thinking about introducing interfaces for each of the Attributes. E.g. OffsetAttribute
would then be an interface with all current methods, and OffsetAttributeImpl would be its
implementation. The user would still use the API in exactly the same way as now, that is be
e.g. calling addAttribute(OffsetAttribute.class), and the code takes care of instantiating
the right class. However, there would then also be an API to pass in an actual instance, and
this API would use reflection to find all interfaces that the instances implements. All of
those interfaces that extend the Attribute interface would be added to the AttributeSource
map, with the instance as the value.

Then the Token class would implement all six attribute interfaces. An expert user could decide
to pass in a Token instance instead of calling addAttribute(TermAttribute.class), addAttribute(PayloadAttribute.class),
Then the attribute source would only contain a single instance that needs to be cloned in
captureState(), making cloning much faster. And a (probably also expert) user could even implement
an own class that implements exactly the necessary interfaces (maybe only 3 of the 6 provided),
and make cloning faster than it is even with the old Token-based API.

And of course also in your case could you just create a different implementation of such an
interface, right? I think what's nice about this change is that it doesn't make it more complicated
to use the TokenStream API, and the indexing pipeline still uses it the same way too, yet
it's more extensible more expert users and possible to achieve the same or even better cloning

I will open a new Jira issue for this soon. But I'd be happy to hear feedback about the proposed
changes, and especially if you think these changes would help you for your usecase.


On 4/27/09 1:49 PM, eks dev wrote: 
Should I create a patch with something like this?     With "Expert" javadoc, and explanation
what is this good for should be a nice addition to Attribute cases.  Practically, it would
enable specialization of "hard linked" Attributes like TermAttribute.     The only preconditions
are:     - "Specialized Attribute" must extend one of the "hard linked" ones, and provide
class of it  - Must implement default constructor   - should extend by not introducing state
(big majority of cases) (not to break captureState())    The last one could be relaxed i guess,
but I am not yet 100% familiar with this code.    Use cases for this are along the lines of
my example, smaller, easier user code and performance (token filters mainly)        -----
Original Message ----    
From: Uwe Schindler <>  To:  Sent: Sunday,
26 April, 2009 23:03:06  Subject: RE: new TokenStream api Question    There is one problem:
if you extend TermAttribute, the class is different  (which is the key in the attributes list).
So when you initialize the  TokenStream and do a    YourClass termAtt = (YourClass) addAttribute(YourClass.class) create a new attribute. So one possibility would be to also specify  the instance
and save the attribute by class (as key), but with your  instance. If you are the first one
that creates the attribute (if it is a  token stream and not a filter it is ok, you will be
the first, it adding the  attribute in the ctor), everything is ok. Register the attribute
by yourself  (maybe we should add a specialized addAttribute, that can specify a instance
 as default)?:    YourClass termAtt = new YourClass();  attributes.put(TermAttribute.class,
termAtt);    In this case, for
 the indexer it is a standard TermAttribute, but you can  more with it.    Replacing TermAttribute
by an own class is not possible, as the indexer will  get a ClassCastException when using
the instance retrieved with  getAttribute(TermAttribute.class).    Uwe    -----  Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen  eMail:    
-----Original Message-----  From: eks dev []  Sent: Sunday, April
26, 2009 10:39 PM  To:  Subject: new TokenStream api Question 
    I am just looking into new TermAttribute usage and wonder what would be  the best way
to implement PrefixFilter that would filter out some Terms  that have some prefix,    something
like this, where '-' represents my prefix:      public final boolean incrementToken() throws
IOException {      // the first word we found      while (input.incrementToken()) {      
 int len = termAtt.termLength();          if(len > 0 && termAtt.termBuffer()[0]!='-')
//only length > 0 and  non LFs      return true;        // note: else we ignore it    
 }      // reached EOS      return false;    }            The question would be:    can I
extend TermAttribute and add boolean startsWith(char c);    The point is speed and my code
gets smaller.  TermAttribute has one method called in
 termLength() and termBuffer() I do  not understand (back compatibility, I guess)    public
int termLength() {      initTermBuffer(); // I'd like to avoid it...      return termLength;
   }      I'd like to get rid of initTermBuffer(), the first option is to *extend*  TermAttribute
code (but fields are private, so no help there) or can I  implement my own MyTermAttribute
(will Indexer know how to deal with it?)    Must I extend TermAttribute or I can add my own?
   thanks,  eks          ---------------------------------------------------------------------
 To unsubscribe, e-mail:  For additional commands,
  ---------------------------------------------------------------------  To unsubscribe, e-mail:  For additional commands, e-mail:
              ---------------------------------------------------------------------  To unsubscribe,
e-mail:  For additional commands, e-mail:

View raw message