lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Deprecation of Token constructors in 2.4
Date Tue, 02 Dec 2008 01:31:30 GMT
This was done under LUCENE-1333, in moving Token to the "reuse" API.

The general idea is to deprecate the String based APIs (whose
performance got worse because of the switch to char[]), in favor of
the char[] APIs.

That said, I do agree it's still useful to have String based variants,
at the expense of performance, for those cases that want convenience
and don't care much about performance.

However, we can't silently cause an existing API to have worse
performance (it's a break to back-compatability).  So I think we
should instead deprecate the old method and introduce a new one with a
clear warning about the performance penalty.  EG, we did this with
termText (now deprecated in favor of term()).  This way on upgrading,
people will have to confront the fact that its performance is now
worse than before.

I'm not sure what ctor we could add that could take String, and
doesn't already exist.

Further... the changes for LUCENE-1422 (committed to 2.9-dev) make
this question moot since the whole Token class is now deprecated.


Shai Erera wrote:

> Hi
> I moved to use Lucene 2.4 and noticed that some of the Token  
> constructors were marked deprecated. Specifically, I'm talking about  
> the Token(String, int, int), where the String is the word to  
> populate the token with, and the two ints are startOffset and  
> endOffset respectively.
> That was an amazingly convenient constructor. I understand that it  
> is discouraged to use it, and that the one that accepts a char[] is  
> better, but there are cases, during indexing, where all you have at  
> hand is a String and not a char[].
> For example, suppose that you want to add a certain token to a  
> document. You can do this by adding a Field with a TokenStream,  
> where the TS will create a new Token(), populate it with the value  
> to add and return it.
> Before 2.4, the code was simply:
> return new Token(word, start, end);
> After Lucene 2.4, the code looks like this:
> Token t = new Token();
> t.setTermBuffer(word, 0, word.length());
> t.setStartOffset(start);
> t.setEndOffset(end);
> return t;
> Instead of a one liner, I now have to write 5 (!) lines of code  
> whenever I want to do something like this. And ... the fact that I  
> can call setTermBuffer(String, int, int) (like I do in the 2nd line  
> of the code) does not prevent me from using Strings at all. The only  
> thing that the deprecation of the constructor achieves is  
> complicating the code developers need to write.
> IMO, there is a huge difference between removing a convenient, yet  
> inefficient, method, than simply document that it's discouraged to  
> use it. After all, if I choose to use it at my own expense, I can do  
> it and face whatever consequences there will be.
> In fact, when I moved to use setTermBuffer, I actually introduced a  
> bug in my code. The reason is that setTermBuffer accepts to integers  
> which specify the offset and length of the internal char[] of Token,  
> rather than my start and end offset I used to use when I had the  
> Token(String, int, int) constructor. That's confusing.
> And if I raise the deprecation issue, the method termText() which  
> returns a String was also a convenient (eventhough inefficient)  
> method, for all kind of purposes amongst which are debugging or  
> printing. But not only that - Java makes use of String in so many  
> places, it's really hard to stay with a char[] for long, as soon as  
> you start involving Lucene code with other Java data structures. So  
> instead of calling termText() (and knowing it's inefficient, and  
> even document it), I now have to write new String(t.termBuffer(), 0,  
> t.termLength()).
> I would like to ask the developers community - what is the strategy  
> of deprecating methods? Again, we should document when certain  
> methods are inefficient, rather than deprecating them, and thus  
> forcing people to write more cumbersome code.
> A good example for a justified deprecation is IndexWriter.docCount()  
> method, which recommends to use maxDoc() or numDocs() (if you want  
> to take into account deletions). There are two reasons it's justified:
> 1. Replacing docCount() with maxDoc() does not complicate my code.
> 2. docCount() is confusing (is it maxDoc() or numDocs()?).
> On that notion, the deprecation of simply forces  
> people who want to store the Tokens output TS to call  
> Token()). Not a huge inconvenience, but IMO an  
> unnecessary deprecation.
> I'm not sure this is new stuff to you, and perhaps if was even  
> raised on this list before. I'm also aware of the efforts to  
> completely change the tokenization process in Lucene. However, I  
> think that if we could undeprecate some of the deprecated methods,  
> we'd do a great service to the developers community.
> Shai

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message