lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mindaugas Žakšauskas <min...@gmail.com>
Subject Re: How is incrementToken supposed to detect the lack of reset()?
Date Wed, 08 Jan 2014 10:37:39 GMT
Just for the interest, I had a similar problem too as well as other
people [1]. In my project, I am extending the Tokenizer class and have
another tokenizer (e.g. ClassicTokenizer) as a delegate.
Unfortunately, properly overriding all public/protected methods is
*not* enough, e.g.:

public void reset() throws IOException {
  super.reset();
  delegate.reset();
}

I was still getting the exception of broken read()/close() contract.
Half day and *lots* of debugging later, I realized that exception is
only thrown when indexing second document only as the delegate reader
internally gets replaced with ILLEGAL_STATE_READER after .close() is
called. My solution to this problem was to make the reset() method
like this:

public void reset() throws IOException {
  super.reset();
  delegate.setReader(input);
  delegate.reset();
}

Another thing worth mentioning is that it's crucial to have
super.method() before delegate.method() in all overridden methods.
Would be nice if all of this was somewhere in the Tokenizer Javadoc,
or even nicer if the base class was designed with delegation in mind
(Effective Java (2nd edition), Item 16).

Hope this helps somebody.

[1] http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673

Regards,
Mindaugas

On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies <benson@basistech.com> wrote:
> Yes I Do.
>
>
> On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir <rcmuir@gmail.com> wrote:
>
>> Benson, do you want to open an issue to fix this constructor to not
>> take Reader? (there might be one already, but lets make a new one).
>>
>> These things are supposed to be reused, and have setReader for that
>> purpose. i think its confusing and contributes to bugs that you have
>> to have logic in e.g. the ctor THEN ALSO in reset().
>>
>> if someone does it correctly in the ctor, but they only test "one
>> time", they might think everything is working..
>>
>> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies <benson@basistech.com>
>> wrote:
>> > For the record of other people who implement tokenizers:
>> >
>> > Say that your tokenizer has a constructor, like:
>> >
>> >      public MyTokenizer(Reader reader, ....) {
>> >        super(reader);
>> >        myWrappedInputDevice = new MyWrappedInputDevice(reader);
>> >     }
>> >
>> > Not a good idea. Tokenizer carefully manages the data flow from the
>> > constructor arg to the 'input' field. The correct form is:
>> >
>> >  public MyTokenizer(Reader reader, ....) {
>> >        super(reader);
>> >        myWrappedInputDevice = new MyWrappedInputDevice(this.input);
>> >     }
>> >
>> >
>> >
>> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> >> See Tokenizer.java for the state machine logic. In general you should
>> >> not have to do anything if the tokenizer is well-behaved (e.g. close
>> >> calls super.close() and so on).
>> >>
>> >>
>> >>
>> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies <bimargulies@gmail.com
>> >
>> >> wrote:
>> >> > In 4.6.0,
>> >> org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
>> >> >
>> >> > fails if incrementToken fails to throw if there's a missing reset.
>> >> >
>> >> > How am I supposed to organize this in a Tokenizer? A quick look at
>> >> > CharTokenizer did not reveal any code for the purpose.
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message