lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Sevi <erans...@gmail.com>
Subject Re: Bug in StandardAnalyzer + StopAnalyzer?
Date Mon, 16 Nov 2009 08:04:28 GMT
I think it's always better to do what you can inside the code and don't
leave it for the clients that calls the method (just code duplication and a
potential place for error if it's forgotten).
So if it's not complicated I think the call to reset() on the chain of
filters should be the responsibility of each analyzer and not the indexer/QP
and others.

Changing the method name is also a good idea in my opinion.

On Sun, Nov 15, 2009 at 8:33 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Uwe, I will throw another twist at it: I do not like the name of
> Tokenizer.reset(Reader) method.
> I wish it was called .setReader() or something else, I think it is
> confusing for Tokenizer to have .reset() and .reset(Reader), when the latter
> should hardly ever be overridden.
>
> Great example of how insane this is right now, is StandardTokenizer:
>
>  @Override
>   public void reset(Reader reader) throws IOException {
>     super.reset(reader);
>     reset();
>   }
>
> LOL!
>
>
>  On Sun, Nov 15, 2009 at 12:45 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>>  Yes, but on the other hand it does not hurt to automaticall reset in the
>> analyzer.... **krr** I do not know how to proceed. I think we should keep
>> it as it was since the beginning of Lucene (call to reset inside analyzer,
>> QP) and document it correctly.
>>
>>
>>
>> You are right, at the beginning, BaseTokenStreamTestCase and many other
>> hardcoded tests did not call reset. Now, the test also calls end() and
>> close().
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>   ------------------------------
>>
>> *From:* Robert Muir [mailto:rcmuir@gmail.com]
>> *Sent:* Sunday, November 15, 2009 6:40 PM
>>
>> *To:* java-dev@lucene.apache.org
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> ok, at one point i do not think BaseTokenStreamTestCase did.
>>
>> if this is the case, then its the consumer's responsibility to call reset,
>> and we should remove extra resets() inside reusableTokenStream() from
>> analyzers that have it... and probably improve the docs of this contract.
>>
>> On Sun, Nov 15, 2009 at 12:17 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> Even QueryParser calls reset() as first call. Also BaseTokenStreamTestCase
>> does it.
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>    ------------------------------
>>
>> *From:* Robert Muir [mailto:rcmuir@gmail.com]
>> *Sent:* Sunday, November 15, 2009 6:14 PM
>>
>>
>> *To:* java-dev@lucene.apache.org
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Uwe, not so sure it doesn't need to be there, what about other consumers
>> such as QueryParser?
>>
>> On Sun, Nov 15, 2009 at 12:02 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> I checked again, reset() on the top filter does not need to be there, as
>> the indexer calls it automatically as the first call after
>> reusableTokenStream. For reusing only reset(Reader) must be called. It’s a
>> little bit strange that both methods have the same name, the reset(Reader)
>> one has a completely different meaning.
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>    ------------------------------
>>
>> *From:* Uwe Schindler [mailto:uwe@thetaphi.de]
>> *Sent:* Sunday, November 15, 2009 5:56 PM
>>
>>
>> *To:* java-dev@lucene.apache.org
>>
>> *Subject:* RE: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> It should be there... But ist unimplemented in the TokenFilters used by
>> Standard/Stop Analyzer. Buf for consistency it should be there. I’ll talk
>> with Robert Muir about it.
>>
>>
>>
>> Uwe
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>   ------------------------------
>>
>> *From:* Eran Sevi [mailto:eransevi@gmail.com]
>> *Sent:* Sunday, November 15, 2009 5:51 PM
>> *To:* java-dev@lucene.apache.org
>> *Subject:* Re: Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Good point. I missed that part :) since only the tokenizer uses the
>> reader, we must call it directly.
>>
>> So the reset() on the filteredTokenStream was omitted on purpose because
>> there's not underlying implementation? or is it really missing?
>>
>> On Sun, Nov 15, 2009 at 6:30 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> It must call both reset on the top-level TokenStream and reset(Reader) on
>> the Tokenizer-. If the latter is not done, how should the TokenStream get
>> his new Reader?
>>
>>
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>   ------------------------------
>>
>> *From:* Eran Sevi [mailto:eransevi@gmail.com]
>> *Sent:* Sunday, November 15, 2009 5:19 PM
>> *To:* java-dev@lucene.apache.org
>> *Subject:* Bug in StandardAnalyzer + StopAnalyzer?
>>
>>
>>
>> Hi,
>> when changing my code to support the not-so-new reusableTokenStream I
>> noticed that in the cases when a SavedStream class was used in an analyzer
>> (Standard,Stop and maybe others as well) the reset() method is called on the
>> tokenizer instead of on the filter.
>>
>> The filter implementation of reset() calls the inner filters+input reset()
>> methods, but the tokenizer reset() method can't do that.
>> I think this bug hasn't caused any errors so far since none of the filters
>> used in the analyzers overrides the reset() method, but it might cause
>> problems if the implementation changes in the future.
>>
>> the fix is very simple. for example (in StandardAnalyzer):
>>
>> if (streams == null) {
>>       streams = new SavedStreams();
>>       setPreviousTokenStream(streams);
>>       streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>>       streams.filteredTokenStream = new
>> StandardFilter(streams.tokenStream);
>>       streams.filteredTokenStream = new
>> LowerCaseFilter(streams.filteredTokenStream);
>>       streams.filteredTokenStream = new
>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>
>> streams.filteredTokenStream, stopSet);
>>     } else {
>>       streams.tokenStream.reset(reader);
>>     }
>>
>> should become:
>>
>> if (streams == null) {
>>       streams = new SavedStreams();
>>       setPreviousTokenStream(streams);
>>       streams.tokenStream = new StandardTokenizer(matchVersion, reader);
>>       streams.filteredTokenStream = new
>> StandardFilter(streams.tokenStream);
>>       streams.filteredTokenStream = new
>> LowerCaseFilter(streams.filteredTokenStream);
>>       streams.filteredTokenStream = new
>> StopFilter(StopFilter.getEnablePositionIncrementsVersionDefault(matchVersion),
>>
>> streams.filteredTokenStream, stopSet);
>>     } else {
>>       streams.filteredTokenStream.reset(); // changed line.
>>     }
>>
>>
>> What do you think?
>>
>> Eran.
>>
>>
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
View raw message