Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of serera@gmail.com designates
 209.85.214.176 as permitted sender)
MIME-Version: 1.0
Date: Sun, 13 May 2012 11:32:22 +0300
Message-ID: 
 <CALfq-2RqS+4qLrBzm9cp6jAO6xg2d4tgDdQwuut9FfZPA1FKKg@mail.gmail.com>
Subject: Tokenizer.reset()
From: Shai Erera <serera@gmail.com>
To: dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=14dae9399833f0acc904bfe6ce49

--14dae9399833f0acc904bfe6ce49
Content-Type: text/plain; charset=ISO-8859-1

Hi

Someone asked why the following tiny test does not work:

  String text = "Hello world1. Hello world2";
  Tokenizer tokenizer = WhitespaceTokenizer(Version.LUCENE_36, new
StringReader(text));
  int count = 0;
  while (tokenizer.incrementToken()) {
    count++;
  }
  assertEquals(4,count);

  // expecting reset() to do what it states:
  tokenizer.reset();

  count = 0;
  while (tokenizer.incrementToken()) {
    count++;
  }
  assertEquals(4,count);  // HERE IT FAILS

The answer was easy -- WhitespaceTokenizer (or Tokenizer) do not implement
reset() in any special way. But then when we reviewed the javadocs of
reset() in 3.6 and trunk, I became confused myself:

In 3.6, it's mentioned that the method resets the stream to the beginning,
however it is optional and sub-classes may or may not implement this.

In trunk it doesn't say 'optionally' but rather "Resets this stream to the
beginning".

I'm not getting into text semantics, but rather want to ask why wouldn't
Tokenizer override reset() to call input.reset()? Reader has a reset()
method and some Readers, like StringReader, even implement it properly. We
can still say in the jdocs that reset() depends on the Reader.reset()
implementation and it may or may not reset the stream.

I don't know if it's a bug that Tokenizer.reset() doesn't do that, or not
-- I'm sure someone like Robert or Uwe know the answer :).

Alternatively, wouldn't it be better if reset() threw
UnsupportedOperationException in TokenStream and any other Tokenizer that
cannot support it? Otherwise, you have no way telling whether reset() is
supported or not, besides reading the Tokenizer's code. Maybe in trunk we
should just make TokenStream.reset() abstract?

BTW, even reading the code does not always help -- WikipediaTokenizer does
implement reset() by calling super.reset() + scanner.reset(), but neither
call input.reset() ... so I'm not sure if WikiTokenizer is buggy or not (so
maybe it does help to read the code :)).

Shai

--14dae9399833f0acc904bfe6ce49
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi<br><br>Someone asked why the following tiny test does n=
ot work:<br><br>=A0 String text =3D &quot;Hello world1. Hello world2&quot;;=
 <br>=A0 Tokenizer tokenizer =3D WhitespaceTokenizer(Version.LUCENE_36, new=
 StringReader(text));<br>
=A0 int count =3D 0;<br>=A0 while (tokenizer.incrementToken()) {<br>=A0=A0=
=A0 count++;<br>=A0 }<br>=A0 assertEquals(4,count);<br>=A0 <br>=A0 // expec=
ting reset() to do what it states: =A0=A0=A0 <br>=A0 tokenizer.reset();<br>=
=A0=A0=A0 =A0=A0=A0 <br>=A0 count =3D 0;<br>
=A0 while (tokenizer.incrementToken()) {<br>=A0=A0=A0 count++;<br>=A0 }<br>=
=A0 assertEquals(4,count);=A0 // HERE IT FAILS<br><br>The answer was easy -=
- WhitespaceTokenizer (or Tokenizer) do not implement reset() in any specia=
l way. But then when we reviewed the javadocs of reset() in 3.6 and trunk, =
I became confused myself:<br>
<br>In 3.6, it&#39;s mentioned that the method resets the stream to the beg=
inning, however it is optional and sub-classes may or may not implement thi=
s.<br><br>In trunk it doesn&#39;t say &#39;optionally&#39; but rather &quot=
;Resets this stream to the beginning&quot;.<br>
<br>I&#39;m not getting into text semantics, but rather want to ask why wou=
ldn&#39;t Tokenizer override reset() to call input.reset()? Reader has a re=
set() method and some Readers, like StringReader, even implement it properl=
y. We can still say in the jdocs that reset() depends on the Reader.reset()=
 implementation and it may or may not reset the stream.<br>
<br>I don&#39;t know if it&#39;s a bug that Tokenizer.reset() doesn&#39;t d=
o that, or not -- I&#39;m sure someone like Robert or Uwe know the answer :=
).<br><br>Alternatively, wouldn&#39;t it be better if reset() threw Unsuppo=
rtedOperationException in TokenStream and any other Tokenizer that cannot s=
upport it? Otherwise, you have no way telling whether reset() is supported =
or not, besides reading the Tokenizer&#39;s code. Maybe in trunk we should =
just make TokenStream.reset() abstract?<br>
<br>BTW, even reading the code does not always help -- WikipediaTokenizer d=
oes implement reset() by calling super.reset() + scanner.reset(), but neith=
er call input.reset() ... so I&#39;m not sure if WikiTokenizer is buggy or =
not (so maybe it does help to read the code :)).<br>
<br>Shai<br></div>

--14dae9399833f0acc904bfe6ce49--