Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 27750 invoked from network); 15 Nov 2009 17:46:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 15 Nov 2009 17:46:35 -0000 Received: (qmail 6852 invoked by uid 500); 15 Nov 2009 17:46:34 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 6772 invoked by uid 500); 15 Nov 2009 17:46:34 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 6764 invoked by uid 99); 15 Nov 2009 17:46:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Nov 2009 17:46:34 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [85.25.71.29] (HELO mail.troja.net) (85.25.71.29) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 15 Nov 2009 17:46:21 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.troja.net (Postfix) with ESMTP id B9E3FD36005 for ; Sun, 15 Nov 2009 18:46:01 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at mail.troja.net Received: from mail.troja.net ([127.0.0.1]) by localhost (megaira.troja.net [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NKJqzDRIfokw for ; Sun, 15 Nov 2009 18:45:51 +0100 (CET) Received: from VEGA (port-83-236-62-54.dynamic.qsc.de [83.236.62.54]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.troja.net (Postfix) with ESMTPSA id 3B52FD36002 for ; Sun, 15 Nov 2009 18:45:51 +0100 (CET) From: "Uwe Schindler" To: References: <74f928500911150819i5392abcfk7ce41c66ee52f84d@mail.gmail.com> <724C9B41972C46068B31FAF7FB039A99@VEGA> <74f928500911150850j274bee39l1d7996037a1c048c@mail.gmail.com> <2A2C477DCEA0467BA8E8727493324001@VEGA> <8f0ad1f30911150913x4f172e6ne7726fb354c2f03@mail.gmail.com> <3649013EBFB1494A8750C7D627A7E407@VEGA> <8f0ad1f30911150939t41e3e447n69efd78e1f5c4f47@mail.gmail.com> Subject: RE: Bug in StandardAnalyzer + StopAnalyzer? Date: Sun, 15 Nov 2009 18:45:50 +0100 Message-ID: <23497685C0B94A979DE1C9AD978FF9AE@VEGA> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0031_01CA6623.DAB4C910" X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <8f0ad1f30911150939t41e3e447n69efd78e1f5c4f47@mail.gmail.com> Thread-Index: AcpmGsL8/Z/DH6QXSk6pKqgIWuLmsAAAGXEA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579 X-Virus-Checked: Checked by ClamAV on apache.org ------=_NextPart_000_0031_01CA6623.DAB4C910 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Yes, but on the other hand it does not hurt to automaticall reset in the analyzer.... *krr* I do not know how to proceed. I think we should keep it as it was since the beginning of Lucene (call to reset inside analyzer, QP) and document it correctly. You are right, at the beginning, BaseTokenStreamTestCase and many other hardcoded tests did not call reset. Now, the test also calls end() and close(). ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de _____ From: Robert Muir [mailto:rcmuir@gmail.com] Sent: Sunday, November 15, 2009 6:40 PM To: java-dev@lucene.apache.org Subject: Re: Bug in StandardAnalyzer + StopAnalyzer? ok, at one point i do not think BaseTokenStreamTestCase did. if this is the case, then its the consumer's responsibility to call reset, and we should remove extra resets() inside reusableTokenStream() from analyzers that have it... and probably improve the docs of this contract. On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler wrote: Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase does it. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de _____ From: Robert Muir [mailto:rcmuir@gmail.com] Sent: Sunday, November 15, 2009 6:14 PM To: java-dev@lucene.apache.org Subject: Re: Bug in StandardAnalyzer + StopAnalyzer? Uwe, not so sure it doesn't need to be there, what about other consumers such as QueryParser? On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler wrote: I checked again, reset() on the top filter does not need to be there, as the indexer calls it automatically as the first call after reusableTokenStream. For reusing only reset(Reader) must be called. It's a little bit strange that both methods have the same name, the reset(Reader) one has a completely different meaning. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de _____ From: Uwe Schindler [mailto:uwe@thetaphi.de] Sent: Sunday, November 15, 2009 5:56 PM To: java-dev@lucene.apache.org Subject: RE: Bug in StandardAnalyzer + StopAnalyzer? It should be there... But ist unimplemented in the TokenFilters used by Standard/Stop Analyzer. Buf for consistency it should be there. I'll talk with Robert Muir about it. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de _____ From: Eran Sevi [mailto:eransevi@gmail.com] Sent: Sunday, November 15, 2009 5:51 PM To: java-dev@lucene.apache.org Subject: Re: Bug in StandardAnalyzer + StopAnalyzer? Good point. I missed that part :) since only the tokenizer uses the reader, we must call it directly. So the reset() on the filteredTokenStream was omitted on purpose because there's not underlying implementation? or is it really missing? On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler wrote: It must call both reset on the top-level TokenStream and reset(Reader) on the Tokenizer-. If the latter is not done, how should the TokenStream get his new Reader? ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: uwe@thetaphi.de _____ From: Eran Sevi [mailto:eransevi@gmail.com] Sent: Sunday, November 15, 2009 5:19 PM To: java-dev@lucene.apache.org Subject: Bug in StandardAnalyzer + StopAnalyzer? Hi, when changing my code to support the not-so-new reusableTokenStream I noticed that in the cases when a SavedStream class was used in an analyzer (Standard,Stop and maybe others as well) the reset() method is called on the tokenizer instead of on the filter. The filter implementation of reset() calls the inner filters+input reset() methods, but the tokenizer reset() method can't do that. I think this bug hasn't caused any errors so far since none of the filters used in the analyzers overrides the reset() method, but it might cause problems if the implementation changes in the future. the fix is very simple. for example (in StandardAnalyzer): if (streams == null) { streams = new SavedStreams(); setPreviousTokenStream(streams); streams.tokenStream = new StandardTokenizer(matchVersion, reader); streams.filteredTokenStream = new StandardFilter(streams.tokenStream); streams.filteredTokenStream = new LowerCaseFilter(streams.filteredTokenStream); streams.filteredTokenStream = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion ), streams.filteredTokenStream, stopSet); } else { streams.tokenStream.reset(reader); } should become: if (streams == null) { streams = new SavedStreams(); setPreviousTokenStream(streams); streams.tokenStream = new StandardTokenizer(matchVersion, reader); streams.filteredTokenStream = new StandardFilter(streams.tokenStream); streams.filteredTokenStream = new LowerCaseFilter(streams.filteredTokenStream); streams.filteredTokenStream = new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion ), streams.filteredTokenStream, stopSet); } else { streams.filteredTokenStream.reset(); // changed line. } What do you think? Eran. -- Robert Muir rcmuir@gmail.com -- Robert Muir rcmuir@gmail.com ------=_NextPart_000_0031_01CA6623.DAB4C910 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Yes, but on the = other hand it does not hurt to automaticall reset in the analyzer.... = *krr* I do not know how to proceed. = I think we should keep it as it was since the beginning of Lucene (call to reset = inside analyzer, QP) and document it correctly.

 =

You are right, = at the beginning, BaseTokenStreamTestCase and many other hardcoded tests did = not call reset. Now, the test also calls end() and = close().

 =

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


From: = Robert Muir [mailto:rcmuir@gmail.com]
Sent: Sunday, November = 15, 2009 6:40 PM
To: java-dev@lucene.apache.org
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?

 

ok, at one = point i do not think BaseTokenStreamTestCase did.

if this is the case, then its the consumer's responsibility to call = reset, and we should remove extra resets() inside reusableTokenStream() from = analyzers that have it... and probably improve the docs of this = contract.

On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe@thetaphi.de> = wrote:

Even QueryParser calls reset() as first = call. Also BaseTokenStreamTestCase does it.

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


From: Robert Muir [mailto:rcmuir@gmail.com] =
Sent: Sunday, November = 15, 2009 6:14 PM


To: java-dev@lucene.apache.org
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?

 

Uwe, = not so sure it doesn't need to be there, what about other consumers such as = QueryParser?

On = Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe@thetaphi.de> = wrote:

I checked again, reset() on the top filter = does not need to be there, as the indexer calls it automatically as the first = call after reusableTokenStream. For reusing only reset(Reader) must be called. = It’s a little bit strange that both methods have the same name, the = reset(Reader) one has a completely different meaning.

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


From: Uwe Schindler [mailto:uwe@thetaphi.de] =
Sent: Sunday, November = 15, 2009 5:56 PM

Subject: RE: Bug in = StandardAnalyzer + StopAnalyzer?

 

It should be there... But ist = unimplemented in the TokenFilters used by Standard/Stop Analyzer. Buf for consistency it = should be there. I’ll talk with Robert Muir about = it.

 

Uwe

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


From: Eran Sevi [mailto:eransevi@gmail.com]
Sent: Sunday, November = 15, 2009 5:51 PM
To: java-dev@lucene.apache.org
Subject: Re: Bug in StandardAnalyzer + StopAnalyzer?

 

Good = point. I missed that part :) since only the tokenizer uses the reader, we must = call it directly.

So the reset() on the filteredTokenStream was omitted on purpose because there's not underlying implementation? or is it really = missing?

On = Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe@thetaphi.de> = wrote:

It must call both reset on the top-level TokenStream and reset(Reader) on the Tokenizer-. If the latter is not = done, how should the TokenStream get his new Reader?

 

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


From: Eran Sevi [mailto:eransevi@gmail.com]
Sent: Sunday, November = 15, 2009 5:19 PM
To: java-dev@lucene.apache.org
Subject: Bug in = StandardAnalyzer + StopAnalyzer?

 

Hi,
when changing my code to support the not-so-new reusableTokenStream I = noticed that in the cases when a SavedStream class was used in an analyzer (Standard,Stop and maybe others as well) the reset() method is called on = the tokenizer instead of on the filter.

The filter implementation of reset() calls the inner filters+input = reset() methods, but the tokenizer reset() method can't do that.
I think this bug hasn't caused any errors so far since none of the = filters used in the analyzers overrides the reset() method, but it might cause = problems if the implementation changes in the future.

the fix is very simple. for example (in StandardAnalyzer):

if (streams =3D=3D null) {
      streams =3D new SavedStreams();
      setPreviousTokenStream(streams);
      streams.tokenStream =3D new StandardTokenizer(matchVersion, reader);
      streams.filteredTokenStream =3D new StandardFilter(streams.tokenStream);
      streams.filteredTokenStream =3D new LowerCaseFilter(streams.filteredTokenStream);
      streams.filteredTokenStream =3D new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVers= ion),
            &= nbsp;           &n= bsp;           &nb= sp;           &nbs= p;  streams.filteredTokenStream, stopSet);
    } else {
      streams.tokenStream.reset(reader);
    }

should become:

if (streams =3D=3D null) {
      streams =3D new SavedStreams();
      setPreviousTokenStream(streams);
      streams.tokenStream =3D new StandardTokenizer(matchVersion, reader);
      streams.filteredTokenStream =3D new StandardFilter(streams.tokenStream);
      streams.filteredTokenStream =3D new LowerCaseFilter(streams.filteredTokenStream);
      streams.filteredTokenStream =3D new StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVers= ion),
            &= nbsp;           &n= bsp;           &nb= sp;           &nbs= p;  streams.filteredTokenStream, stopSet);
    } else {
      streams.filteredTokenStream.reset(); // = changed line.
    }


What do you think?

Eran.

 




--
Robert Muir
rcmuir@gmail.com




--
Robert Muir
rcmuir@gmail.com

------=_NextPart_000_0031_01CA6623.DAB4C910--