Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 44960 invoked from network); 17 Nov 2009 03:31:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Nov 2009 03:31:31 -0000 Received: (qmail 45556 invoked by uid 500); 17 Nov 2009 03:31:30 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 45450 invoked by uid 500); 17 Nov 2009 03:31:30 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 45442 invoked by uid 99); 17 Nov 2009 03:31:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2009 03:31:30 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00,HTML_MESSAGE X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.160.46 as permitted sender) Received: from [209.85.160.46] (HELO mail-pw0-f46.google.com) (209.85.160.46) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2009 03:31:27 +0000 Received: by pwj17 with SMTP id 17so4100282pwj.5 for ; Mon, 16 Nov 2009 19:31:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=09Uu+5FDG/T7FZqSjQlxWxNxvcwXiNUFD3hxkBDiFNs=; b=KmoijChxI7r+Z8tbznNd3L7+FcwSljvJ3bfeHMqLVFgRUkyz0Rfi0OlecT0RpkHjct yOAWV8IGdDsoV+r6ORsM10C1vlxxddRqb+X1YJ6p5TcaKknb+bLNZ9L8+0PBIW7yH57m d1WGQJx1qMR1aOpl4Ja3pW4rpRu3c9TUMaaaY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=v9bQnV2chgeBajaAs5O3LKOp1YUXRCVo+g2fmHNzQwetqk1WNZUklhJGIXSS2KK+4O 9qBlNALcuKPQfEizCP7b3H5EnSH0tV9lDTlR+ii7GTZ8rdVDwmvSLM1MpodXWafO7MM5 7RURQ7fL0CzmGPmeerMptw9cB4miG4yHjEBRY= MIME-Version: 1.0 Received: by 10.114.237.30 with SMTP id k30mr7697609wah.102.1258428667299; Mon, 16 Nov 2009 19:31:07 -0800 (PST) In-Reply-To: <8f0ad1f30911161857j3b62d1b7m3db52b84a3fe888c@mail.gmail.com> References: <359a92830911161010s2b04fe80s3c8b69b522518ca8@mail.gmail.com> <8f0ad1f30911161543n344957eobc5dcb88d14eb85b@mail.gmail.com> <8f0ad1f30911161653vbebd9c3ma896c2572e38590f@mail.gmail.com> <0F7CC1FA-3913-4FCC-B78E-28D2F887C693@gmail.com> <8f0ad1f30911161825u719a960fm4a371755dfdd9f38@mail.gmail.com> <4B020AA3.30304@gmail.com> <8f0ad1f30911161844q3bc54362nc464e75090e42995@mail.gmail.com> <4B020ECE.2090802@gmail.com> <8f0ad1f30911161857j3b62d1b7m3db52b84a3fe888c@mail.gmail.com> From: Robert Muir Date: Mon, 16 Nov 2009 22:30:47 -0500 Message-ID: <8f0ad1f30911161930x369a30fdp8fe888c13516263f@mail.gmail.com> Subject: Re: Why release 3.0? To: java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=0016e64b95e0b0899b047888c12c --0016e64b95e0b0899b047888c12c Content-Type: text/plain; charset=UTF-8 actually i thought about this. i change my story. deprecating anything is stupid, because its still not back compatible, i.e. Character.isLetter(char) even returns different results now, even if we invoke it. hard break is the only solution. we should have done this deprecation in 2.9, but its chicken-and-egg, could not do it because you need java 5 to support unicode 4. On Mon, Nov 16, 2009 at 9:57 PM, Robert Muir wrote: > completely ignoring the difficulty, I would propose to fix everything to > correspond with the java 1.5 unicode version, for consistency. > I would exempt StandardTokenizer, because its completely inside our > control. we can fix it at our leisure. > > for the rest of this stuff, its already a 'change in runtime behavior' when > moving from 1.4 to 1.5, even though we didn't touch code. > i would suggest making this a one-time pain for the users so they dont have > to do it again in 3.1 > this means for CharTokenizer adding the deprecations and reflection and > caching for the reflection that Uwe did to make TokenStream fast and work > like this. > and mucking with complicated i/o buffering logic as mentioned before. > > > For the other side, I'll tell you what I have done in practice. > I usually say, there is no way in hell I will refactor some existing > codebase to support suppl. characters. > And i find a way to isolate just chinese, support it for only that > language, and leave the other stuff broken. > > I'm not really sure that is the appropriate way to go for apache lucene, > but I felt it was fair to at least give that perspective. > Even if we did that, the non-chinese users still need to reindex anyway, > except for nothing (no real gain, they still don't have unicode 4 support, > just different behavior). > > > On Mon, Nov 16, 2009 at 9:47 PM, Mark Miller wrote: > >> So whats your best recommendation? Ignoring the difficulty and just >> considering whats best for users? >> >> Robert Muir wrote: >> > well, in all honesty there is a bit of complexity. >> > i leave the StandardTokenizer out of this, it gives the same results >> > regardless of JVM version. >> > it may not be correct, but its consistent, we could wait till 5.0 or >> > 10.0 to make it correct :) >> > Also, because it gives the same results regardless of JVM version, we >> > can actually use the Version logic to improve it, as Uwe showed. >> > >> > The rest of it is where it gets nasty, >> > Fixing the Simple/StopAnalyzer is actually the worst, because we have >> > to deprecate the isTokenChar(char) and normalize(char) callbacks in >> > favor of int-based versions. >> > We also have to fix this i/o buffering logic present in for example, >> > CharTokenizer, which just does things like refill a buffer of size >> > 4096 without checking to ensure it doesn't break a surrogate pair. >> > >> > and then we have contrib...! >> > >> > so you see why i ask about 'index backwards compatibility', because I >> > don't consider it actually working between 2.9->3.0 anyway, and adding >> > that on top of fixing this stuff, and ensuring API backwards compat, >> > that's especially nasty. >> > >> > >> > >> > Always depends though. This double index thing you mention is >> > nasty (3.0 >> > and 3.1 for the unfortunate). I'd swallow a few careful >> > deprecations in >> > 3.0 to avoid that with my vote. >> > >> > -- >> > - Mark >> > >> > http://www.lucidimagination.com >> > >> > >> > >> > >> > >> --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> > >> > For additional commands, e-mail: java-dev-help@lucene.apache.org >> > >> > >> > >> > >> > >> > -- >> > Robert Muir >> > rcmuir@gmail.com >> >> >> -- >> - Mark >> >> http://www.lucidimagination.com >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-dev-help@lucene.apache.org >> >> > > > -- > Robert Muir > rcmuir@gmail.com > -- Robert Muir rcmuir@gmail.com --0016e64b95e0b0899b047888c12c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable actually i thought about this. i change my story.

deprecating anythi= ng is stupid, because its still not back compatible, i.e. Character.isLette= r(char) even returns different results now, even if we invoke it.

hard break is the only solution.

we should have done this deprecatio= n in 2.9, but its chicken-and-egg, could not do it because you need java 5 = to support unicode 4.

On Mon, Nov 16, 200= 9 at 9:57 PM, Robert Muir <rcmuir@gmail.com> wrote:
completely ignori= ng the difficulty, I would propose to fix everything to correspond with the= java 1.5 unicode version, for consistency.
I would exempt StandardTokenizer, because its completely inside our control= . we can fix it at our leisure.

for the rest of this stuff, its already a 'change in runtime behavi= or' when moving from 1.4 to 1.5, even though we didn't touch code.<= br>i would suggest making this a one-time pain for the users so they dont h= ave to do it again in 3.1
this means for CharTokenizer adding the deprecations and reflection and cac= hing for the reflection that Uwe did to make TokenStream fast and work like= this.
and mucking with complicated i/o buffering logic as mentioned bef= ore.


For the other side, I'll tell you what I have done in practice.=
I usually say, there is no way in hell I will refactor some existing co= debase to support suppl. characters.
And i find a way to isolate just ch= inese, support it for only that language, and leave the other stuff broken.=

I'm not really sure that is the appropriate way to go for apache lu= cene, but I felt it was fair to at least give that perspective.
Even if = we did that, the non-chinese users still need to reindex anyway, except for= nothing (no real gain, they still don't have unicode 4 support, just d= ifferent behavior).


On Mon, Nov 16, 2009 at 9:47 PM, Mark Miller= <markrmiller@gmail.com> wrote:
So whats your best recommendation? Ignoring the difficulty and just
considering whats best for users?

Robert Muir wrote:
> well, in all honesty there is a bit of complexity.
> i leave the StandardTokenizer out of this, it gives the same results > regardless of JVM version.
> it may not be correct, but its consistent, we could wait till 5.0 or > 10.0 to make it correct :)
> Also, because it gives the same results regardless of JVM version, we<= br> > can actually use the Version logic to improve it, as Uwe showed.
>
> The rest of it is where it gets nasty,
> Fixing the Simple/StopAnalyzer is actually the worst, because we have<= br> > to deprecate the isTokenChar(char) and normalize(char) callbacks in > favor of int-based versions.
> We also have to fix this i/o buffering logic present in for example, > CharTokenizer, which just does things like refill a buffer of size
> 4096 without checking to ensure it doesn't break a surrogate pair.=
>
> and then we have contrib...!
>
> so you see why i ask about 'index backwards compatibility', be= cause I
> don't consider it actually working between 2.9->3.0 anyway, and= adding
> that on top of fixing this stuff, and ensuring API backwards compat, > that's especially nasty.
>
>
>
> =C2=A0 =C2=A0 Always depends though. This double index thing you menti= on is
> =C2=A0 =C2=A0 nasty (3.0
> =C2=A0 =C2=A0 and 3.1 for the unfortunate). I'd swallow a few care= ful
> =C2=A0 =C2=A0 deprecations in
> =C2=A0 =C2=A0 3.0 to avoid that with my vote.
>
> =C2=A0 =C2=A0 --
> =C2=A0 =C2=A0 - Mark
>
> =C2=A0 =C2=A0 http://www.lucidimagination.com
>
>
>
>
> =C2=A0 =C2=A0 --------------------------------------------------------= -------------
> =C2=A0 =C2=A0 To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apa= che.org
> =C2=A0 =C2=A0 <mailto:java-dev-unsubscribe@lucene.apac= he.org>
> =C2=A0 =C2=A0 For additional commands, e-mail: java-dev-help@lucene.apa= che.org
> =C2=A0 =C2=A0 <mailto:java-dev-help@lucene.apache.org>
>
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com= <mailto:rcmui= r@gmail.com>


--
- Mark

http://www.lu= cidimagination.com




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org




--
Robert Muir=
rcmuir@gmail.com<= /a>



--
Robert Muir=
rcmuir@gmail.com
--0016e64b95e0b0899b047888c12c--