Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7F0B99B20 for ; Sun, 13 May 2012 08:32:52 +0000 (UTC) Received: (qmail 56681 invoked by uid 500); 13 May 2012 08:32:51 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 56625 invoked by uid 500); 13 May 2012 08:32:51 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 56614 invoked by uid 99); 13 May 2012 08:32:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 May 2012 08:32:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of serera@gmail.com designates 209.85.214.176 as permitted sender) Received: from [209.85.214.176] (HELO mail-ob0-f176.google.com) (209.85.214.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 May 2012 08:32:43 +0000 Received: by obbef5 with SMTP id ef5so8165422obb.35 for ; Sun, 13 May 2012 01:32:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=vWtG3aTpnctEEhQGIkKdxGiEBAPgyuzpnQCHYoYxqjE=; b=wEnVBwPn75fpGuBGUPl9QWKIiJtS+XldVlr8wVp2+CVwMzKbJlxW2GwvwZwy9vMWJr oEoLIwEERra0Lbgkc8x2aYeK3cCuuchs6gzPisg1W5bqBiRuPmHjSVKyQfN1Zhu5hZi0 dBrqSaqB4tyCi+gU3F1U73wNvovdiTdPczEra+a9ndEEgzSUq2PixIvP8kuVr32MqFsW M7N+W4QBZJx/WQZfhKl6/LuGJqo5ufn757xpAAARO7ohYcXzzDJls2IEl+pThBBJFTeC 8pSIL36XqOzfcNsX7PZ3UG7iVWUoE6Jnciz71myAV/71vPj5xGb8Y62EmmdyxXFUEo+C CVEg== MIME-Version: 1.0 Received: by 10.182.154.73 with SMTP id vm9mr6017087obb.72.1336897942105; Sun, 13 May 2012 01:32:22 -0700 (PDT) Received: by 10.182.19.168 with HTTP; Sun, 13 May 2012 01:32:22 -0700 (PDT) Date: Sun, 13 May 2012 11:32:22 +0300 Message-ID: Subject: Tokenizer.reset() From: Shai Erera To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=14dae9399833f0acc904bfe6ce49 --14dae9399833f0acc904bfe6ce49 Content-Type: text/plain; charset=ISO-8859-1 Hi Someone asked why the following tiny test does not work: String text = "Hello world1. Hello world2"; Tokenizer tokenizer = WhitespaceTokenizer(Version.LUCENE_36, new StringReader(text)); int count = 0; while (tokenizer.incrementToken()) { count++; } assertEquals(4,count); // expecting reset() to do what it states: tokenizer.reset(); count = 0; while (tokenizer.incrementToken()) { count++; } assertEquals(4,count); // HERE IT FAILS The answer was easy -- WhitespaceTokenizer (or Tokenizer) do not implement reset() in any special way. But then when we reviewed the javadocs of reset() in 3.6 and trunk, I became confused myself: In 3.6, it's mentioned that the method resets the stream to the beginning, however it is optional and sub-classes may or may not implement this. In trunk it doesn't say 'optionally' but rather "Resets this stream to the beginning". I'm not getting into text semantics, but rather want to ask why wouldn't Tokenizer override reset() to call input.reset()? Reader has a reset() method and some Readers, like StringReader, even implement it properly. We can still say in the jdocs that reset() depends on the Reader.reset() implementation and it may or may not reset the stream. I don't know if it's a bug that Tokenizer.reset() doesn't do that, or not -- I'm sure someone like Robert or Uwe know the answer :). Alternatively, wouldn't it be better if reset() threw UnsupportedOperationException in TokenStream and any other Tokenizer that cannot support it? Otherwise, you have no way telling whether reset() is supported or not, besides reading the Tokenizer's code. Maybe in trunk we should just make TokenStream.reset() abstract? BTW, even reading the code does not always help -- WikipediaTokenizer does implement reset() by calling super.reset() + scanner.reset(), but neither call input.reset() ... so I'm not sure if WikiTokenizer is buggy or not (so maybe it does help to read the code :)). Shai --14dae9399833f0acc904bfe6ce49 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi

Someone asked why the following tiny test does n= ot work:

=A0 String text =3D "Hello world1. Hello world2";=
=A0 Tokenizer tokenizer =3D WhitespaceTokenizer(Version.LUCENE_36, new= StringReader(text));
=A0 int count =3D 0;
=A0 while (tokenizer.incrementToken()) {
=A0=A0= =A0 count++;
=A0 }
=A0 assertEquals(4,count);
=A0
=A0 // expec= ting reset() to do what it states: =A0=A0=A0
=A0 tokenizer.reset();
= =A0=A0=A0 =A0=A0=A0
=A0 count =3D 0;
=A0 while (tokenizer.incrementToken()) {
=A0=A0=A0 count++;
=A0 }
= =A0 assertEquals(4,count);=A0 // HERE IT FAILS

The answer was easy -= - WhitespaceTokenizer (or Tokenizer) do not implement reset() in any specia= l way. But then when we reviewed the javadocs of reset() in 3.6 and trunk, = I became confused myself:

In 3.6, it's mentioned that the method resets the stream to the beg= inning, however it is optional and sub-classes may or may not implement thi= s.

In trunk it doesn't say 'optionally' but rather "= ;Resets this stream to the beginning".

I'm not getting into text semantics, but rather want to ask why wou= ldn't Tokenizer override reset() to call input.reset()? Reader has a re= set() method and some Readers, like StringReader, even implement it properl= y. We can still say in the jdocs that reset() depends on the Reader.reset()= implementation and it may or may not reset the stream.

I don't know if it's a bug that Tokenizer.reset() doesn't d= o that, or not -- I'm sure someone like Robert or Uwe know the answer := ).

Alternatively, wouldn't it be better if reset() threw Unsuppo= rtedOperationException in TokenStream and any other Tokenizer that cannot s= upport it? Otherwise, you have no way telling whether reset() is supported = or not, besides reading the Tokenizer's code. Maybe in trunk we should = just make TokenStream.reset() abstract?

BTW, even reading the code does not always help -- WikipediaTokenizer d= oes implement reset() by calling super.reset() + scanner.reset(), but neith= er call input.reset() ... so I'm not sure if WikiTokenizer is buggy or = not (so maybe it does help to read the code :)).

Shai
--14dae9399833f0acc904bfe6ce49--