lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Tokenizer.reset()
Date Sun, 13 May 2012 08:36:12 GMT
I found that there's an open issue LUCENE-2191 about renaming reset(Reader)
to setReader(), due to confusing APIs. I didn't see any buggy behavior
mentioned there, but I sense that reset() and reset(Reader) confused other
people too :). So perhaps this fix (if indeed others will agree it needs to
be fixed) can happen under that issue.

Shai

On Sun, May 13, 2012 at 11:32 AM, Shai Erera <serera@gmail.com> wrote:

> Hi
>
> Someone asked why the following tiny test does not work:
>
>   String text = "Hello world1. Hello world2";
>   Tokenizer tokenizer = WhitespaceTokenizer(Version.LUCENE_36, new
> StringReader(text));
>   int count = 0;
>   while (tokenizer.incrementToken()) {
>     count++;
>   }
>   assertEquals(4,count);
>
>   // expecting reset() to do what it states:
>   tokenizer.reset();
>
>   count = 0;
>   while (tokenizer.incrementToken()) {
>     count++;
>   }
>   assertEquals(4,count);  // HERE IT FAILS
>
> The answer was easy -- WhitespaceTokenizer (or Tokenizer) do not implement
> reset() in any special way. But then when we reviewed the javadocs of
> reset() in 3.6 and trunk, I became confused myself:
>
> In 3.6, it's mentioned that the method resets the stream to the beginning,
> however it is optional and sub-classes may or may not implement this.
>
> In trunk it doesn't say 'optionally' but rather "Resets this stream to the
> beginning".
>
> I'm not getting into text semantics, but rather want to ask why wouldn't
> Tokenizer override reset() to call input.reset()? Reader has a reset()
> method and some Readers, like StringReader, even implement it properly. We
> can still say in the jdocs that reset() depends on the Reader.reset()
> implementation and it may or may not reset the stream.
>
> I don't know if it's a bug that Tokenizer.reset() doesn't do that, or not
> -- I'm sure someone like Robert or Uwe know the answer :).
>
> Alternatively, wouldn't it be better if reset() threw
> UnsupportedOperationException in TokenStream and any other Tokenizer that
> cannot support it? Otherwise, you have no way telling whether reset() is
> supported or not, besides reading the Tokenizer's code. Maybe in trunk we
> should just make TokenStream.reset() abstract?
>
> BTW, even reading the code does not always help -- WikipediaTokenizer does
> implement reset() by calling super.reset() + scanner.reset(), but neither
> call input.reset() ... so I'm not sure if WikiTokenizer is buggy or not (so
> maybe it does help to read the code :)).
>
> Shai
>

Mime
View raw message