lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <ben...@basistech.com>
Subject Re: How is incrementToken supposed to detect the lack of reset()?
Date Wed, 08 Jan 2014 12:18:04 GMT
If you'd like to join in on the doc, see
https://github.com/apache/lucene-solr/pull/14/files. I'd be happy to grant
you access to push to my fork.


On Wed, Jan 8, 2014 at 5:37 AM, Mindaugas Žakšauskas <mindas@gmail.com>wrote:

> Just for the interest, I had a similar problem too as well as other
> people [1]. In my project, I am extending the Tokenizer class and have
> another tokenizer (e.g. ClassicTokenizer) as a delegate.
> Unfortunately, properly overriding all public/protected methods is
> *not* enough, e.g.:
>
> public void reset() throws IOException {
>   super.reset();
>   delegate.reset();
> }
>
> I was still getting the exception of broken read()/close() contract.
> Half day and *lots* of debugging later, I realized that exception is
> only thrown when indexing second document only as the delegate reader
> internally gets replaced with ILLEGAL_STATE_READER after .close() is
> called. My solution to this problem was to make the reset() method
> like this:
>
> public void reset() throws IOException {
>   super.reset();
>   delegate.setReader(input);
>   delegate.reset();
> }
>
> Another thing worth mentioning is that it's crucial to have
> super.method() before delegate.method() in all overridden methods.
> Would be nice if all of this was somewhere in the Tokenizer Javadoc,
> or even nicer if the base class was designed with delegation in mind
> (Effective Java (2nd edition), Item 16).
>
> Hope this helps somebody.
>
> [1]
> http://stackoverflow.com/questions/20624339/having-trouble-rereading-a-lucene-tokenstream/20630673#20630673
>
> Regards,
> Mindaugas
>
> On Tue, Jan 7, 2014 at 9:45 PM, Benson Margulies <benson@basistech.com>
> wrote:
> > Yes I Do.
> >
> >
> > On Tue, Jan 7, 2014 at 3:59 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >
> >> Benson, do you want to open an issue to fix this constructor to not
> >> take Reader? (there might be one already, but lets make a new one).
> >>
> >> These things are supposed to be reused, and have setReader for that
> >> purpose. i think its confusing and contributes to bugs that you have
> >> to have logic in e.g. the ctor THEN ALSO in reset().
> >>
> >> if someone does it correctly in the ctor, but they only test "one
> >> time", they might think everything is working..
> >>
> >> On Tue, Jan 7, 2014 at 3:23 PM, Benson Margulies <benson@basistech.com>
> >> wrote:
> >> > For the record of other people who implement tokenizers:
> >> >
> >> > Say that your tokenizer has a constructor, like:
> >> >
> >> >      public MyTokenizer(Reader reader, ....) {
> >> >        super(reader);
> >> >        myWrappedInputDevice = new MyWrappedInputDevice(reader);
> >> >     }
> >> >
> >> > Not a good idea. Tokenizer carefully manages the data flow from the
> >> > constructor arg to the 'input' field. The correct form is:
> >> >
> >> >  public MyTokenizer(Reader reader, ....) {
> >> >        super(reader);
> >> >        myWrappedInputDevice = new MyWrappedInputDevice(this.input);
> >> >     }
> >> >
> >> >
> >> >
> >> > On Tue, Jan 7, 2014 at 2:59 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >> >
> >> >> See Tokenizer.java for the state machine logic. In general you should
> >> >> not have to do anything if the tokenizer is well-behaved (e.g. close
> >> >> calls super.close() and so on).
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jan 7, 2014 at 2:50 PM, Benson Margulies <
> bimargulies@gmail.com
> >> >
> >> >> wrote:
> >> >> > In 4.6.0,
> >> >>
> org.apache.lucene.analysis.BaseTokenStreamTestCase#checkResetException
> >> >> >
> >> >> > fails if incrementToken fails to throw if there's a missing reset.
> >> >> >
> >> >> > How am I supposed to organize this in a Tokenizer? A quick look
at
> >> >> > CharTokenizer did not reveal any code for the purpose.
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message