lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera" <ser...@gmail.com>
Subject Deprecation of Token constructors in 2.4
Date Wed, 26 Nov 2008 21:36:53 GMT
Hi

I moved to use Lucene 2.4 and noticed that some of the Token constructors
were marked deprecated. Specifically, I'm talking about the Token(String,
int, int), where the String is the word to populate the token with, and the
two ints are startOffset and endOffset respectively.
That was an amazingly convenient constructor. I understand that it is
discouraged to use it, and that the one that accepts a char[] is better, but
there are cases, during indexing, where all you have at hand is a String and
not a char[].

For example, suppose that you want to add a certain token to a document. You
can do this by adding a Field with a TokenStream, where the TS will create a
new Token(), populate it with the value to add and return it.

Before 2.4, the code was simply:
return new Token(word, start, end);

After Lucene 2.4, the code looks like this:
Token t = new Token();
t.setTermBuffer(word, 0, word.length());
t.setStartOffset(start);
t.setEndOffset(end);
return t;

Instead of a one liner, I now have to write 5 (!) lines of code whenever I
want to do something like this. And ... the fact that I can call
setTermBuffer(String, int, int) (like I do in the 2nd line of the code) does
not prevent me from using Strings at all. The only thing that the
deprecation of the constructor achieves is complicating the code developers
need to write.

IMO, there is a huge difference between removing a convenient, yet
inefficient, method, than simply document that it's discouraged to use it.
After all, if I choose to use it at my own expense, I can do it and face
whatever consequences there will be.

In fact, when I moved to use setTermBuffer, I actually introduced a bug in
my code. The reason is that setTermBuffer accepts to integers which specify
the offset and length of the internal char[] of Token, rather than my start
and end offset I used to use when I had the Token(String, int, int)
constructor. That's confusing.

And if I raise the deprecation issue, the method termText() which returns a
String was also a convenient (eventhough inefficient) method, for all kind
of purposes amongst which are debugging or printing. But not only that -
Java makes use of String in so many places, it's really hard to stay with a
char[] for long, as soon as you start involving Lucene code with other Java
data structures. So instead of calling termText() (and knowing it's
inefficient, and even document it), I now have to write new
String(t.termBuffer(), 0, t.termLength()).

I would like to ask the developers community - what is the strategy of
deprecating methods? Again, we should document when certain methods are
inefficient, rather than deprecating them, and thus forcing people to write
more cumbersome code.

A good example for a justified deprecation is IndexWriter.docCount() method,
which recommends to use maxDoc() or numDocs() (if you want to take into
account deletions). There are two reasons it's justified:
1. Replacing docCount() with maxDoc() does not complicate my code.
2. docCount() is confusing (is it maxDoc() or numDocs()?).

On that notion, the deprecation of TokenStream.next() simply forces people
who want to store the Tokens output TS to call TokenStream.next(new
Token()). Not a huge inconvenience, but IMO an unnecessary deprecation.

I'm not sure this is new stuff to you, and perhaps if was even raised on
this list before. I'm also aware of the efforts to completely change the
tokenization process in Lucene. However, I think that if we could
undeprecate some of the deprecated methods, we'd do a great service to the
developers community.

Shai

Mime
View raw message