lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Tokenizer.reset()
Date Sun, 13 May 2012 08:32:22 GMT
Hi

Someone asked why the following tiny test does not work:

  String text = "Hello world1. Hello world2";
  Tokenizer tokenizer = WhitespaceTokenizer(Version.LUCENE_36, new
StringReader(text));
  int count = 0;
  while (tokenizer.incrementToken()) {
    count++;
  }
  assertEquals(4,count);

  // expecting reset() to do what it states:
  tokenizer.reset();

  count = 0;
  while (tokenizer.incrementToken()) {
    count++;
  }
  assertEquals(4,count);  // HERE IT FAILS

The answer was easy -- WhitespaceTokenizer (or Tokenizer) do not implement
reset() in any special way. But then when we reviewed the javadocs of
reset() in 3.6 and trunk, I became confused myself:

In 3.6, it's mentioned that the method resets the stream to the beginning,
however it is optional and sub-classes may or may not implement this.

In trunk it doesn't say 'optionally' but rather "Resets this stream to the
beginning".

I'm not getting into text semantics, but rather want to ask why wouldn't
Tokenizer override reset() to call input.reset()? Reader has a reset()
method and some Readers, like StringReader, even implement it properly. We
can still say in the jdocs that reset() depends on the Reader.reset()
implementation and it may or may not reset the stream.

I don't know if it's a bug that Tokenizer.reset() doesn't do that, or not
-- I'm sure someone like Robert or Uwe know the answer :).

Alternatively, wouldn't it be better if reset() threw
UnsupportedOperationException in TokenStream and any other Tokenizer that
cannot support it? Otherwise, you have no way telling whether reset() is
supported or not, besides reading the Tokenizer's code. Maybe in trunk we
should just make TokenStream.reset() abstract?

BTW, even reading the code does not always help -- WikipediaTokenizer does
implement reset() by calling super.reset() + scanner.reset(), but neither
call input.reset() ... so I'm not sure if WikiTokenizer is buggy or not (so
maybe it does help to read the code :)).

Shai

Mime
View raw message