Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates
 209.85.160.46 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=v9bQnV2chgeBajaAs5O3LKOp1YUXRCVo+g2fmHNzQwetqk1WNZUklhJGIXSS2KK+4O
         9qBlNALcuKPQfEizCP7b3H5EnSH0tV9lDTlR+ii7GTZ8rdVDwmvSLM1MpodXWafO7MM5
         7RURQ7fL0CzmGPmeerMptw9cB4miG4yHjEBRY=
MIME-Version: 1.0
In-Reply-To: <8f0ad1f30911161857j3b62d1b7m3db52b84a3fe888c@mail.gmail.com>
References: <359a92830911161010s2b04fe80s3c8b69b522518ca8@mail.gmail.com>
	<8f0ad1f30911161543n344957eobc5dcb88d14eb85b@mail.gmail.com>
	<D452C3FA-F2DA-4A49-9F95-177DA26B0A8D@gmail.com>
 <8f0ad1f30911161653vbebd9c3ma896c2572e38590f@mail.gmail.com>
	<0F7CC1FA-3913-4FCC-B78E-28D2F887C693@gmail.com>
 <8f0ad1f30911161825u719a960fm4a371755dfdd9f38@mail.gmail.com>
	<4B020AA3.30304@gmail.com>
 <8f0ad1f30911161844q3bc54362nc464e75090e42995@mail.gmail.com>
	<4B020ECE.2090802@gmail.com>
 <8f0ad1f30911161857j3b62d1b7m3db52b84a3fe888c@mail.gmail.com>
From: Robert Muir <rcmuir@gmail.com>
Date: Mon, 16 Nov 2009 22:30:47 -0500
Message-ID: <8f0ad1f30911161930x369a30fdp8fe888c13516263f@mail.gmail.com>
Subject: Re: Why release 3.0?
To: java-dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=0016e64b95e0b0899b047888c12c

--0016e64b95e0b0899b047888c12c
Content-Type: text/plain; charset=UTF-8

actually i thought about this. i change my story.

deprecating anything is stupid, because its still not back compatible, i.e.
Character.isLetter(char) even returns different results now, even if we
invoke it.

hard break is the only solution.

we should have done this deprecation in 2.9, but its chicken-and-egg, could
not do it because you need java 5 to support unicode 4.

On Mon, Nov 16, 2009 at 9:57 PM, Robert Muir <rcmuir@gmail.com> wrote:

> completely ignoring the difficulty, I would propose to fix everything to
> correspond with the java 1.5 unicode version, for consistency.
> I would exempt StandardTokenizer, because its completely inside our
> control. we can fix it at our leisure.
>
> for the rest of this stuff, its already a 'change in runtime behavior' when
> moving from 1.4 to 1.5, even though we didn't touch code.
> i would suggest making this a one-time pain for the users so they dont have
> to do it again in 3.1
> this means for CharTokenizer adding the deprecations and reflection and
> caching for the reflection that Uwe did to make TokenStream fast and work
> like this.
> and mucking with complicated i/o buffering logic as mentioned before.
>
>
> For the other side, I'll tell you what I have done in practice.
> I usually say, there is no way in hell I will refactor some existing
> codebase to support suppl. characters.
> And i find a way to isolate just chinese, support it for only that
> language, and leave the other stuff broken.
>
> I'm not really sure that is the appropriate way to go for apache lucene,
> but I felt it was fair to at least give that perspective.
> Even if we did that, the non-chinese users still need to reindex anyway,
> except for nothing (no real gain, they still don't have unicode 4 support,
> just different behavior).
>
>
> On Mon, Nov 16, 2009 at 9:47 PM, Mark Miller <markrmiller@gmail.com>wrote:
>
>> So whats your best recommendation? Ignoring the difficulty and just
>> considering whats best for users?
>>
>> Robert Muir wrote:
>> > well, in all honesty there is a bit of complexity.
>> > i leave the StandardTokenizer out of this, it gives the same results
>> > regardless of JVM version.
>> > it may not be correct, but its consistent, we could wait till 5.0 or
>> > 10.0 to make it correct :)
>> > Also, because it gives the same results regardless of JVM version, we
>> > can actually use the Version logic to improve it, as Uwe showed.
>> >
>> > The rest of it is where it gets nasty,
>> > Fixing the Simple/StopAnalyzer is actually the worst, because we have
>> > to deprecate the isTokenChar(char) and normalize(char) callbacks in
>> > favor of int-based versions.
>> > We also have to fix this i/o buffering logic present in for example,
>> > CharTokenizer, which just does things like refill a buffer of size
>> > 4096 without checking to ensure it doesn't break a surrogate pair.
>> >
>> > and then we have contrib...!
>> >
>> > so you see why i ask about 'index backwards compatibility', because I
>> > don't consider it actually working between 2.9->3.0 anyway, and adding
>> > that on top of fixing this stuff, and ensuring API backwards compat,
>> > that's especially nasty.
>> >
>> >
>> >
>> >     Always depends though. This double index thing you mention is
>> >     nasty (3.0
>> >     and 3.1 for the unfortunate). I'd swallow a few careful
>> >     deprecations in
>> >     3.0 to avoid that with my vote.
>> >
>> >     --
>> >     - Mark
>> >
>> >     http://www.lucidimagination.com
>> >
>> >
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> >     To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> >     <mailto:java-dev-unsubscribe@lucene.apache.org>
>> >     For additional commands, e-mail: java-dev-help@lucene.apache.org
>> >     <mailto:java-dev-help@lucene.apache.org>
>> >
>> >
>> >
>> >
>> > --
>> > Robert Muir
>> > rcmuir@gmail.com <mailto:rcmuir@gmail.com>
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>


-- 
Robert Muir
rcmuir@gmail.com

--0016e64b95e0b0899b047888c12c
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

actually i thought about this. i change my story.<br><br>deprecating anythi=
ng is stupid, because its still not back compatible, i.e. Character.isLette=
r(char) even returns different results now, even if we invoke it.<br><br>

hard break is the only solution.<br><br>we should have done this deprecatio=
n in 2.9, but its chicken-and-egg, could not do it because you need java 5 =
to support unicode 4.<br><br><div class=3D"gmail_quote">On Mon, Nov 16, 200=
9 at 9:57 PM, Robert Muir <span dir=3D"ltr">&lt;<a href=3D"mailto:rcmuir@gm=
ail.com">rcmuir@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"border-left: 1px solid rgb(204, =
204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">completely ignori=
ng the difficulty, I would propose to fix everything to correspond with the=
 java 1.5 unicode version, for consistency.<br>

I would exempt StandardTokenizer, because its completely inside our control=
. we can fix it at our leisure.<br>
<br>for the rest of this stuff, its already a &#39;change in runtime behavi=
or&#39; when moving from 1.4 to 1.5, even though we didn&#39;t touch code.<=
br>i would suggest making this a one-time pain for the users so they dont h=
ave to do it again in 3.1<br>


this means for CharTokenizer adding the deprecations and reflection and cac=
hing for the reflection that Uwe did to make TokenStream fast and work like=
 this.<br>and mucking with complicated i/o buffering logic as mentioned bef=
ore.<br>


<br><br>For the other side, I&#39;ll tell you what I have done in practice.=
<br>I usually say, there is no way in hell I will refactor some existing co=
debase to support suppl. characters.<br>And i find a way to isolate just ch=
inese, support it for only that language, and leave the other stuff broken.=
<br>


<br>I&#39;m not really sure that is the appropriate way to go for apache lu=
cene, but I felt it was fair to at least give that perspective.<br>Even if =
we did that, the non-chinese users still need to reindex anyway, except for=
 nothing (no real gain, they still don&#39;t have unicode 4 support, just d=
ifferent behavior).<div>

<div></div><div class=3D"h5"><br>
<br><div class=3D"gmail_quote">On Mon, Nov 16, 2009 at 9:47 PM, Mark Miller=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:markrmiller@gmail.com" target=3D"_=
blank">markrmiller@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"border-left: 1px solid rgb(204, 204, 204); margin: 0p=
t 0pt 0pt 0.8ex; padding-left: 1ex;">


So whats your best recommendation? Ignoring the difficulty and just<br>
considering whats best for users?<br>
<div><div></div><div><br>
Robert Muir wrote:<br>
&gt; well, in all honesty there is a bit of complexity.<br>
&gt; i leave the StandardTokenizer out of this, it gives the same results<b=
r>
&gt; regardless of JVM version.<br>
&gt; it may not be correct, but its consistent, we could wait till 5.0 or<b=
r>
&gt; 10.0 to make it correct :)<br>
&gt; Also, because it gives the same results regardless of JVM version, we<=
br>
&gt; can actually use the Version logic to improve it, as Uwe showed.<br>
&gt;<br>
&gt; The rest of it is where it gets nasty,<br>
&gt; Fixing the Simple/StopAnalyzer is actually the worst, because we have<=
br>
&gt; to deprecate the isTokenChar(char) and normalize(char) callbacks in<br=
>
&gt; favor of int-based versions.<br>
&gt; We also have to fix this i/o buffering logic present in for example,<b=
r>
&gt; CharTokenizer, which just does things like refill a buffer of size<br>
&gt; 4096 without checking to ensure it doesn&#39;t break a surrogate pair.=
<br>
&gt;<br>
&gt; and then we have contrib...!<br>
&gt;<br>
&gt; so you see why i ask about &#39;index backwards compatibility&#39;, be=
cause I<br>
&gt; don&#39;t consider it actually working between 2.9-&gt;3.0 anyway, and=
 adding<br>
&gt; that on top of fixing this stuff, and ensuring API backwards compat,<b=
r>
&gt; that&#39;s especially nasty.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 Always depends though. This double index thing you menti=
on is<br>
&gt; =C2=A0 =C2=A0 nasty (3.0<br>
&gt; =C2=A0 =C2=A0 and 3.1 for the unfortunate). I&#39;d swallow a few care=
ful<br>
&gt; =C2=A0 =C2=A0 deprecations in<br>
&gt; =C2=A0 =C2=A0 3.0 to avoid that with my vote.<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 --<br>
&gt; =C2=A0 =C2=A0 - Mark<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 <a href=3D"http://www.lucidimagination.com" target=3D"_b=
lank">http://www.lucidimagination.com</a><br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; =C2=A0 =C2=A0 --------------------------------------------------------=
-------------<br>
&gt; =C2=A0 =C2=A0 To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsub=
scribe@lucene.apache.org" target=3D"_blank">java-dev-unsubscribe@lucene.apa=
che.org</a><br>
</div></div>&gt; =C2=A0 =C2=A0 &lt;mailto:<a href=3D"mailto:java-dev-unsubs=
cribe@lucene.apache.org" target=3D"_blank">java-dev-unsubscribe@lucene.apac=
he.org</a>&gt;<br>
<div>&gt; =C2=A0 =C2=A0 For additional commands, e-mail: <a href=3D"mailto:=
java-dev-help@lucene.apache.org" target=3D"_blank">java-dev-help@lucene.apa=
che.org</a><br>
</div>&gt; =C2=A0 =C2=A0 &lt;mailto:<a href=3D"mailto:java-dev-help@lucene.=
apache.org" target=3D"_blank">java-dev-help@lucene.apache.org</a>&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; Robert Muir<br>
&gt; <a href=3D"mailto:rcmuir@gmail.com" target=3D"_blank">rcmuir@gmail.com=
</a> &lt;mailto:<a href=3D"mailto:rcmuir@gmail.com" target=3D"_blank">rcmui=
r@gmail.com</a>&gt;<br>
<font color=3D"#888888"><br>
<br>
--<br>
</font><div><div></div><div>- Mark<br>
<br>
<a href=3D"http://www.lucidimagination.com" target=3D"_blank">http://www.lu=
cidimagination.com</a><br>
<br>
<br>
<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:java-dev-unsubscribe@lucene.apach=
e.org" target=3D"_blank">java-dev-unsubscribe@lucene.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:java-dev-help@lucene.apa=
che.org" target=3D"_blank">java-dev-help@lucene.apache.org</a><br>
<br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Robert Muir=
<br><a href=3D"mailto:rcmuir@gmail.com" target=3D"_blank">rcmuir@gmail.com<=
/a><br>
</div></div></blockquote></div><br><br clear=3D"all"><br>-- <br>Robert Muir=
<br><a href=3D"mailto:rcmuir@gmail.com">rcmuir@gmail.com</a><br>

--0016e64b95e0b0899b047888c12c--