lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: New Token API was Re: Payloads and TrieRangeQuery
Date Mon, 15 Jun 2009 17:10:38 GMT

On Jun 14, 2009, at 8:05 PM, Michael Busch wrote:
> I'm not sure why this (currently having to implement next() too) is  
> such an issue for you. You brought it up at the Lucene meetup too.  
> No user will ever have to implement both (the new API and the old)  
> in their streams/filters. The only reason why we did it this way is  
> to not sacrifice performance for existing streams/filters when  
> people switch to Lucene 2.9. I explained this point in the jira issue:
> The only time when we'll ever have to implement both APIs is between  
> now and 2.9, only for new streams and filters that we add before 2.9  
> is released. I don't think it'd be reasonable to consider this  
> disadvantage as a show stopper.

It's an issue b/c I don't like writing dead code and who knows when  
2.9 will actually be out.

I don't think it is a show stopper either.

>> Add on top of it, that the whole point of customizing the chain is  
>> to use it in search and, frankly speaking, somehow I think that  
>> part of the patch was held back.
> I'm not sure what you're implying. Could you elaborate?

Sorry, see my response to Michael M. on this.  I didn't mean to imply  
you were doing something malicious, just that it always felt half done  
to me.  Knowing you, you don't strike me as someone who does things  
half way, so that's why I felt it was held back.  But, as Michael M  
reminded me, it is complex, so please accept my apologies.

> The search side of the API is currently being developed in  
> Lucene-1458. 1458 will not make it into 2.9. Therefore I agree that  
> it is not very advantageous to switch to the new API right now for  
> Lucene users. On the other hand, I don't think it hurts either.

I am not sure I agree here.  Forcing people to upgrade their analyzers  
can be quite involved.  Analyzers are one of the main areas that  
people do custom work.  Solr, for instance, has 11 custom TokenFilters  
right now as well as custom Tokenizers, not too mention the ones used  
during testing that aren't shipped.  Upgrading these is a lot of  
work.  I know in previous jobs, I also maintained a fair number  
TokenStream related stuff.  This should not be underestimated.   
Furthermore, as I said back in the initial discussion, Lucene's  
Analyzer stuff is often used outside of Lucene.

In fact, I often think the Analysis piece should be a standalone jar  
(not requiring core) and that core should have a dependency on it.  In  
other words, move o.a.l.analysis (and contrib/analsis) to a standalone  
module that core depends on.  This would make it easier for others to  
consume the Analysis functionality.

>> I personally would vote for reverting until a complete patch that  
>> addresses both sides of the problem is submitted and a better  
>> solution to cloning is put forth.
> If we revert now and put a new flexible API like this into 3.x,  
> which I think is necessary to utilize flexible indexing, then we'll  
> have to wait until 4.0 before we can remove the old API.  
> Disadvantages like the one you mentioned above, will then probably  
> be present much longer.
> I mentioned in the following thread that I have started working on a  
> better way of cloning, which will actually be faster compared to the  
> old API. I'll try to get the code out asap.
> I'd be happy to discuss other API proposals that anybody brings up  
> here, that have the same advantages and are more intuitive. We could  
> also beef up the documentation and give a better example about how  
> to convert a stream/filter from the old to the new API; a  
> constructive suggestion that Uwe made at the ApacheCon.

My point here was, at the time, that if others wanted to revert, I  
probably would vote for it.  I'm not proposing we do it, as I think we  
can make do with what we have.  Given the discussion here, I would  
probably change my mind and not support it now.

I think it might be helpful to have some help for people upgrading.   
Perhaps an abstract class that provides the "core" Token attributes  
out of the box as a base class that they can then extend?  That being  
said, forcing people to upgrade could at least help them think about  
the fact that they have no use for the Type attribute or the Offsets  
attributes.  And, testing the cloning stuff would help.  I think the  
current approach underestimates the number of people who need to  
buffer tokens in memory before handing them out.  Sure, it's not as  
many as the main use case, but it's not zero either.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message