lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Problem with CharStream and Tokenizers with custom reset(Reader) method
Date Fri, 11 Sep 2009 09:12:48 GMT
I do not know, how this could affect Solr, but it could be the case.
Currently most Tokenizers do not use CharStreams at all. After committing
LUCENE-1906, I think there is also some additional work in Solr's custom
Tokenizers needed (changed the correctOffset method).

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> From: Jason Rutherglen [mailto:jason.rutherglen@gmail.com]
> Sent: Friday, September 11, 2009 12:28 AM
> To: java-dev@lucene.apache.org
> Subject: Re: Problem with CharStream and Tokenizers with custom
> reset(Reader) method
> 
> I've been seeing strange behavior perhaps related to this? Where
> sometimes a query is parsed and analyzed using Solr analyzers to
> it's first clause fairly randomly, and other times the same
> exact query is parsed and analyzed to the full correct query with all
> clauses. It's so baffling I haven't really figured out an
> approach to debugging it. I wonder if it's related to this
> stream resetting issue.
> 
> On Thu, Sep 10, 2009 at 7:54 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
> > When reviewing the new CharStream code added to Tokenizers, I found a
> > serious problem with backwards compatibility and other Tokenizers, that
> do
> > not override reset(CharStream).
> >
> > The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> >
> >  public void reset(Reader input) throws IOException {
> >    super.reset(input);
> >    bufferIndex = 0;
> >    offset = 0;
> >    dataLen = 0;
> >  }
> >
> > If you reset such a Tokenizer with another CharStream (not a Reader),
> this
> > method will never be called and breaking the whole Tokenizer.
> >
> > As CharStream extends Reader, I propose to remove this reset(CharStream
> > method) and simply do an instanceof check to detect if the supplied
> Reader
> > is no CharStream and wrap it. We could also remove the extra ctor
> (because
> > most Tokenizers have no support for passing CharStreams). If the ctor
> also
> > checks with instanceof and warps as needed the code is backwards
> compatible
> > and we do not need to add additional ctors in subclasses.
> >
> > As this instanceof check is always done in CharReader.get() why not
> remove
> > ctor(CharStream) and reset(CharStream) completely?
> >
> > Any thoughts?
> >
> > I would like to fix this somehow before RC4, I', sorry :(
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message