lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Why release 3.0?
Date Tue, 17 Nov 2009 02:57:28 GMT
completely ignoring the difficulty, I would propose to fix everything to
correspond with the java 1.5 unicode version, for consistency.
I would exempt StandardTokenizer, because its completely inside our control.
we can fix it at our leisure.

for the rest of this stuff, its already a 'change in runtime behavior' when
moving from 1.4 to 1.5, even though we didn't touch code.
i would suggest making this a one-time pain for the users so they dont have
to do it again in 3.1
this means for CharTokenizer adding the deprecations and reflection and
caching for the reflection that Uwe did to make TokenStream fast and work
like this.
and mucking with complicated i/o buffering logic as mentioned before.


For the other side, I'll tell you what I have done in practice.
I usually say, there is no way in hell I will refactor some existing
codebase to support suppl. characters.
And i find a way to isolate just chinese, support it for only that language,
and leave the other stuff broken.

I'm not really sure that is the appropriate way to go for apache lucene, but
I felt it was fair to at least give that perspective.
Even if we did that, the non-chinese users still need to reindex anyway,
except for nothing (no real gain, they still don't have unicode 4 support,
just different behavior).

On Mon, Nov 16, 2009 at 9:47 PM, Mark Miller <markrmiller@gmail.com> wrote:

> So whats your best recommendation? Ignoring the difficulty and just
> considering whats best for users?
>
> Robert Muir wrote:
> > well, in all honesty there is a bit of complexity.
> > i leave the StandardTokenizer out of this, it gives the same results
> > regardless of JVM version.
> > it may not be correct, but its consistent, we could wait till 5.0 or
> > 10.0 to make it correct :)
> > Also, because it gives the same results regardless of JVM version, we
> > can actually use the Version logic to improve it, as Uwe showed.
> >
> > The rest of it is where it gets nasty,
> > Fixing the Simple/StopAnalyzer is actually the worst, because we have
> > to deprecate the isTokenChar(char) and normalize(char) callbacks in
> > favor of int-based versions.
> > We also have to fix this i/o buffering logic present in for example,
> > CharTokenizer, which just does things like refill a buffer of size
> > 4096 without checking to ensure it doesn't break a surrogate pair.
> >
> > and then we have contrib...!
> >
> > so you see why i ask about 'index backwards compatibility', because I
> > don't consider it actually working between 2.9->3.0 anyway, and adding
> > that on top of fixing this stuff, and ensuring API backwards compat,
> > that's especially nasty.
> >
> >
> >
> >     Always depends though. This double index thing you mention is
> >     nasty (3.0
> >     and 3.1 for the unfortunate). I'd swallow a few careful
> >     deprecations in
> >     3.0 to avoid that with my vote.
> >
> >     --
> >     - Mark
> >
> >     http://www.lucidimagination.com
> >
> >
> >
> >
> >     ---------------------------------------------------------------------
> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> >     <mailto:java-dev-unsubscribe@lucene.apache.org>
> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
> >     <mailto:java-dev-help@lucene.apache.org>
> >
> >
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
View raw message