Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: pass (nike.apache.org: domain of mbennett.ideaeng@gmail.com
 designates 209.85.212.182 as permitted sender)
MIME-Version: 1.0
Sender: mbennett.ideaeng@gmail.com
In-Reply-To: <0FEFB41EAFE84639A2521DE2A8AED6B5@JackKrupansky>
References: <1574581308.74146.1347527827817.JavaMail.jiratomcat@arcas>
 <alpine.DEB.2.02.1209131522190.415@frisbee>
 <CAOdYfZVw3557kXo3FsHJcXt-xXqkW-iKW2T8F929PoAepsN5KA@mail.gmail.com>
 <CA+NxCsMSqvg-WY3o6WcjbbfYVfiG3QVPK_d0k0QEpepkHHzUyg@mail.gmail.com>
 <7C09B40B9AF7446190E18140406F076D@JackKrupansky>
 <CAOdYfZXz4wMkZawhyZ6X5QOih4odG3te-nRitXr6X5oRUKUWtQ@mail.gmail.com>
 <CA+NxCsN75KcAtER2Hktto2UkZdNGmLac-scRdyhVfnTVc7fSVw@mail.gmail.com>
 <CAOdYfZVbFTAz7uYjvu-e+okYFWJsAhuQ6dAwBKnBGrLvSEm7ZA@mail.gmail.com>
 <CA+NxCsNNzMfQe3ixuV+LcjLfzDG-0UzYWOyskJSQ8C978uVUdw@mail.gmail.com>
 <007001cdbf8a$ab0891b0$0119b510$@thetaphi.de>
 <99C11B58493E4B59B8A8C61774C725D5@JackKrupansky>
 <009a01cdbffa$90bd0b90$b23722b0$@thetaphi.de>
 <0FEFB41EAFE84639A2521DE2A8AED6B5@JackKrupansky>
From: Mark Bennett <mbennett@ideaeng.com>
Date: Sun, 11 Nov 2012 10:34:25 -0800
Message-ID: 
 <CA+NxCsNzD9oN1CmPGje2zBKtAAi43OmQBPrYHR2DgsmhCNKkRg@mail.gmail.com>
Subject: Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira] [Commented]
 (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.
To: dev@lucene.apache.org
Content-Type: multipart/alternative; boundary=90e6ba614138985f0804ce3c7101

--90e6ba614138985f0804ce3c7101
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Jack,

The project uses both a bit of Lucene and then Solr.  So as you suggest,
I'd need a way to configure Solr to use the older fuzzy matching by
default, and I'm not aware of such an option.

The formula to go between percentage and integer edit distance is here:
    http://stackoverflow.com/questions/6087281
Some code:
    3.6.1
contrib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDist=
ance.java
    4.0.0
suggest/src/java/org/apache/lucene/search/spell/LevensteinDistance.java
    4.0.0
suggest/src/java/org/apache/lucene/search/spell/LuceneLevenshteinDistance.j=
ava

--
Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Sun, Nov 11, 2012 at 7:18 AM, Jack Krupansky <jack@basetechnology.com>wr=
ote:

>   Okay, so maybe this is simply a case where =93an adjustment=94 was made=
 to
> Lucene and Solr did not make a corresponding =93adustment=94 to compensat=
e to
> =93preserve=94 functionality. Solr users cannot easily override factory
> methods, but of course the Solr query parser can and probably should.
>
> I=92m not able to determine for sure whether Mark=92s comments were direc=
ted
> strictly at Lucene or Solr. I mean, for a lot of people there isn=92t rea=
lly
> supposed to be a distinction between the two.
>
> So, maybe we need a Solr Jira to the effect of adding a parameter such as
> =93fuzzy.maxEdit=94 which defaults to =932=94, but could be set higher to=
 access
> =93slow=94 fuzzy query. And, maybe that default should also be keyed off =
of the
> schema version or Lucene version so that an =93old=94 Solr app would see =
no
> loss of function, although there is the issue that old queries used a flo=
at
> correlation factor rather than an int edit distance.
>
> Is there a reliable formula for converting from the old correlation facto=
r
> and the new edit distance or at least a reasonable approximation? That
> should be in the Javadoc. I suppose it could be approximated, but there a=
re
> edge cases where the new approach provides more discrimination that the o=
ld
> approach, such as for very short terms.
>
> -- Jack Krupansky
>
>  *From:* Uwe Schindler <uwe@thetaphi.de>
> *Sent:* Sunday, November 11, 2012 2:51 AM
> *To:* dev@lucene.apache.org
> *Subject:* RE: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]
> [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.
>
>
> Yes, if you override the factory method in the classic QueryParser. The
> functionality is still available, you have to just use the classes direct=
ly.
> ****
>
> ****
>
> -----****
>
> Uwe Schindler****
>
> H.-H.-Meier-Allee 63, D-28213 Bremen****
>
> http://www.thetaphi.de****
>
> eMail: uwe@thetaphi.de****
>
> ****
>
> *From:* Jack Krupansky [mailto:jack@basetechnology.com]
> *Sent:* Sunday, November 11, 2012 6:14 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]
> [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.****
>
> ****
>
> =93we did not remove functionality=94****
>
>  ****
>
> Are you saying that full-featured =93classic=94 fuzzy query is still avai=
lable
> in the Lucene query parser? By default? Or via what option?****
>
>
> -- Jack Krupansky****
>
>  ****
>
> *From:* Uwe Schindler <uwe@thetaphi.de> ****
>
> *Sent:* Saturday, November 10, 2012 1:30 PM****
>
> *To:* dev@lucene.apache.org ****
>
> *Subject:* RE: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]
> [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.****
>
>  ****
>
> Just use SlowFuzzyQuery from contrib, I really don=92t understand where t=
he
> issue is? The code is available in the sandbox/query module and is
> available in Lucene 4.0. There is no reason to complain here, we did not
> remove functionality.****
>
>  ****
>
> -----****
>
> Uwe Schindler****
>
> H.-H.-Meier-Allee 63, D-28213 Bremen****
>
> http://www.thetaphi.de****
>
> eMail: uwe@thetaphi.de****
>
>  ****
>
> *From:* mbennett.ideaeng@gmail.com [mailto:mbennett.ideaeng@gmail.com<mbe=
nnett.ideaeng@gmail.com>]
> *On Behalf Of *Mark Bennett
> *Sent:* Saturday, November 10, 2012 10:18 PM
> *To:* dev@lucene.apache.org
> *Subject:* Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]
> [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.****
>
>  ****
>
> Hi guys,
>
> Not expecting to change minds, but found Robert's last email helpful, so
> wanted to try one more round.****
>
> On Fri, Nov 9, 2012 at 5:32 PM, Robert Muir <rcmuir@gmail.com> wrote:****
>
> ...
> This is some analysis chain configuration issue.****
>
>  ****
>
> Interesting, so you would expect that the seed term *would* go through
> analysis before it finds the variants in the index?  If it's supposed to
> work that way then I can recheck my config.  (it wasn't just lowercase,
> that was just an example)
>  ****
>
> If it doesn't work with 100M documents, i don't want it in lucene.****
>
>   ****
>
> Ah, this is very illuminating.  For scalability, big data, etc, that
> certainly makes sense.
>
> But there are many important Intranet search applications that have far
> less than 100M docs, but still need the fine-grained control of
> solr/lucene.  Intranet projects in the 35k to 2M doc range often have eve=
n
> more precise indexing, filtering and faceting requirements, and solr/luce=
ne
> provides that fine blade.
>
> Wouldn't it be more constructive to pick some number, say 100M, and give
> that the "big data" moniker.  Then, perhaps for things are not that
> scalable, have some separate area/label but still retain them.  Discardin=
g
> all use cases < 100M seems draconian.
>
>  ****
>
>
> I would have the same opinion if someone wanted unscalable solutions
> for scoring w/ language models (e.g. not happy with smoothing for
> unknown probabilities), or if someone claimed that spatial queries
> should do slow things because they don't currently support
> interplanetary distances, and so on.****
>
>
> On Fri, Nov 9, 2012 at 7:52 PM, Mark Bennett <mbennett@ideaeng.com> wrote=
:
> > Hi Robert,
> >
> > I acknowledge your "-1" vote, and I'm guessing that your objection is
> maybe
> > 70% "scalability", and only 30% use-case?
> >
> > The older Levenstein stuff has been around for a long time, scalable or
> not,
> > and already in real systems.
> >
> > You seem to have a very "binary" on code being "in" or "out".  Is there
> any
> > room in your world-view of code for "gray code", unsupported, incubator=
,
> > what-have-you?  Maybe analagous to people who jailbreak their iPhones o=
r
> > something?
> >
> > You're an important part of the community, and working at Lucid, etc.,
> and
> > clearly concerned about software quality.  When smart folks like you ha=
ve
> > such sharp opinions I do try to ponder them against my own circumstance=
s.
> >
> > And on the quality of the old code, was it just the scalability, or wer=
e
> > there other concerns such as stability, coding style, or possibly
> > inconsistent results?
> >
> > Isn't the sandbox and admonished reference in Java docs sufficient?
> >
> > I'm harping on this because I'm really between a rock and hard place, a=
nd
> > also posted another question.
> >
> > Just trying to understand your very strong opinions, and I thank you fo=
r
> > your patience in this matter.  This issue is either going to fix or
> break my
> > weekend / next-deliverble.
> >
> > Sincere thanks,
> > Mark
> >
> >
> > --
> > Mark Bennett / New Idea Engineering, Inc. / mbennett@ideaeng.com
> > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
> >
> >
> > On Fri, Nov 9, 2012 at 4:37 PM, Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> I'm -1 for having unscalable shit in lucene's core. This query should
> >> have never been added.
> >>
> >> I don't care if a few people complain because they aren't using
> >> lowercasefilter or some other insanity. Fix your analysis chain. I
> >> don't have any sympathy.
> >>
> >> On Fri, Nov 9, 2012 at 7:35 PM, Jack Krupansky <jack@basetechnology.co=
m
> >
> >> wrote:
> >> > +1 for permitting a choice of fuzzy query implementation.
> >> >
> >> > I agree that we want a super-fast fuzzy query for simple variations,
> but
> >> > I
> >> > also agree that we should have the option to trade off speed for
> >> > function.
> >> >
> >> > But I am also sympathetic to assuring that any core Lucene features =
be
> >> > as
> >> > performant as possible.
> >> >
> >> > Ultimately, if there was a single fuzzy query implementation that di=
d
> >> > everything for everybody all of the time, that would be the way to g=
o,
> >> > but
> >> > if choices need to be made to satisfy competing goals, we should
> support
> >> > going that route.
> >> >
> >> > -- Jack Krupansky
> >> >
> >> > From: Mark Bennett
> >> > Sent: Friday, November 09, 2012 3:48 PM
> >> > To: dev@lucene.apache.org
> >> > Subject: Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]
> >> > [Commented] (LUCENE-2667) Fix FuzzyQuery's defaults, so its fast.
> >> >
> >> > Hi Robert,
> >> >
> >> > On Thu, Sep 13, 2012 at 7:39 PM, Robert Muir <rcmuir@gmail.com>
> wrote:
> >> >>
> >> >> ...
> >> >> ... I'm strongly against having this
> >> >> unscalable garbage in lucene's core.
> >> >>
> >> >> There is no use case for ed > 2, thats just crazy.
> >> >
> >> >
> >> > I promise you there ARE use cases for edit distances > 2, especially
> >> > with
> >> > longer words.  Due to NDA I can't go into details.
> >> >
> >> > Also ed>2 can be useful when COMBINING that low-quality part of the
> >> > search
> >> > with other sub-queries, or additional business rules.  Maybe instead
> of
> >> > boiling an ocean this lets you just boil the sea.  ;-)
> >> >
> >> > I won't comment on the quality of the older Levenstein code, or the
> >> > likely
> >> > very slow performance, nor where the code should live, etc.
> >> >
> >> > But your statement about "no use case for ed > 2" is simply not true=
.
> >> > (whether you'd agree with any of them or not is certainly another
> >> > matter)
> >> >
> >> > I understand your concerns about not having it be the default.  (or
> >> > maybe
> >> > having a giant warning message or something, whatever)
> >> >
> >> >> --
> >> >> lucidworks.com
> >> >>
> >> >> -------------------------------------------------------------------=
--
> >> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >> >>
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org****
>
>  ****
>

--90e6ba614138985f0804ce3c7101
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi Jack,<br><br>The project uses both a bit of Lucene and then Solr.=A0 So =
as you suggest, I&#39;d need a way to configure Solr to use the older fuzzy=
 matching by default, and I&#39;m not aware of such an option.<br><br>The f=
ormula to go between percentage and integer edit distance is here:<br>

=A0=A0=A0 <a href=3D"http://stackoverflow.com/questions/6087281">http://sta=
ckoverflow.com/questions/6087281</a><br>Some code:<br>=A0=A0=A0 3.6.1 contr=
ib/spellchecker/src/java/org/apache/lucene/search/spell/LevensteinDistance.=
java<br>=A0=A0=A0 4.0.0 suggest/src/java/org/apache/lucene/search/spell/Lev=
ensteinDistance.java<br>

=A0=A0=A0 4.0.0 suggest/src/java/org/apache/lucene/search/spell/LuceneLeven=
shteinDistance.java<br clear=3D"all"><br>--<br>Mark Bennett / New Idea Engi=
neering, Inc. / <a href=3D"mailto:mbennett@ideaeng.com">mbennett@ideaeng.co=
m</a><br>

Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513<br>
<br><br><div class=3D"gmail_quote">On Sun, Nov 11, 2012 at 7:18 AM, Jack Kr=
upansky <span dir=3D"ltr">&lt;<a href=3D"mailto:jack@basetechnology.com" ta=
rget=3D"_blank">jack@basetechnology.com</a>&gt;</span> wrote:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex">


<div dir=3D"ltr" vlink=3D"purple" link=3D"blue" lang=3D"DE">
<div dir=3D"ltr">
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">
<div>Okay, so maybe this is simply a case where =93an adjustment=94 was mad=
e to=20
Lucene and Solr did not make a corresponding =93adustment=94 to compensate =
to=20
=93preserve=94 functionality. Solr users cannot easily override factory met=
hods, but=20
of course the Solr query parser can and probably should.</div>
<div>=A0</div>
<div>I=92m not able to determine for sure whether Mark=92s comments were di=
rected=20
strictly at Lucene or Solr. I mean, for a lot of people there isn=92t reall=
y=20
supposed to be a distinction between the two.</div>
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">=A0</div>
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">So, maybe=20
we need a Solr Jira to the effect of adding a parameter such as =93fuzzy.ma=
xEdit=94=20
which defaults to =932=94, but could be set higher to access =93slow=94 fuz=
zy query.=20
And, maybe that default should also be keyed off of the schema version or L=
ucene=20
version so that an =93old=94 Solr app would see no loss of function, althou=
gh there=20
is the issue that old queries used a float correlation factor rather than a=
n int=20
edit distance.</div>
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">=A0</div>
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;">Is there a=20
reliable formula for converting from the old correlation factor and the new=
 edit=20
distance or at least a reasonable approximation? That should be in the Java=
doc.=20
I suppose it could be approximated, but there are edge cases where the new=
=20
approach provides more discrimination that the old approach, such as for ve=
ry=20
short terms.</div>
<div style=3D"font-size:12pt;font-family:&#39;Calibri&#39;"><br>-- Jack=20
Krupansky</div>
<div style=3D"font-size:small;font-style:normal;text-decoration:none;font-f=
amily:&#39;Calibri&#39;;display:inline;font-weight:normal">
<div style=3D"FONT:10pt tahoma">
<div>=A0</div>
<div style=3D"BACKGROUND:#f5f5f5">
<div><b>From:</b> <a title=3D"uwe@thetaphi.de" href=3D"mailto:uwe@thetaphi.=
de" target=3D"_blank">Uwe Schindler</a> </div>
<div><b>Sent:</b> Sunday, November 11, 2012 2:51 AM</div><div><div class=3D=
"h5">
<div><b>To:</b> <a title=3D"dev@lucene.apache.org" href=3D"mailto:dev@lucen=
e.apache.org" target=3D"_blank">dev@lucene.apache.org</a> </div>
<div><b>Subject:</b> RE: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [ji=
ra]=20
[Commented] (LUCENE-2667) Fix FuzzyQuery&#39;s defaults, so its=20
fast.</div></div></div></div></div>
<div>=A0</div></div><div><div class=3D"h5">
<div style=3D"font-size:small;font-style:normal;text-decoration:none;font-f=
amily:&#39;Calibri&#39;;display:inline;font-weight:normal">
<div>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt" lang=3D"EN-US">Yes, if you over=
ride the factory method in the classic QueryParser.=20
The functionality is still available, you have to just use the classes=20
directly.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt" lang=3D"EN-US"><u></u><u></u></=
span>=A0</p>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">-----<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">Uwe=20
Schindler<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">H.-H.-Meier-Allee=20
63, D-28213 Bremen<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt"><a href=3D"http://www.thetaphi.=
de/" target=3D"_blank"><span style=3D"COLOR:blue">http://www.thetaphi.de</s=
pan></a><u></u><u></u></span></p>


<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">eMail:=20
<a href=3D"mailto:uwe@thetaphi.de" target=3D"_blank">uwe@thetaphi.de</a><u>=
</u><u></u></span></p></div></div>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt"><u></u><u></u></span>=A0</p>
<div style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:blue 1.5pt solid;PADDIN=
G-BOTTOM:0cm;PADDING-LEFT:4pt;PADDING-RIGHT:0cm;BORDER-TOP:medium none;BORD=
ER-RIGHT:medium none;PADDING-TOP:0cm">
<div>
<div style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOT=
TOM:0cm;PADDING-LEFT:0cm;PADDING-RIGHT:0cm;BORDER-TOP:#b5c4df 1pt solid;BOR=
DER-RIGHT:medium none;PADDING-TOP:3pt">
<p class=3D"MsoNormal"><b><span style=3D"FONT-FAMILY:&#39;Tahoma&#39;,&#39;=
sans-serif&#39;;FONT-SIZE:10pt">From:</span></b><span style=3D"FONT-FAMILY:=
&#39;Tahoma&#39;,&#39;sans-serif&#39;;FONT-SIZE:10pt"> Jack Krupansky=20
[mailto:<a href=3D"mailto:jack@basetechnology.com" target=3D"_blank">jack@b=
asetechnology.com</a>] <br><b>Sent:</b> Sunday, November 11, 2012 6:14=20
AM<br><b>To:</b> <a href=3D"mailto:dev@lucene.apache.org" target=3D"_blank"=
>dev@lucene.apache.org</a><br><b>Subject:</b> Re: FuzzyQuery vs=20
SlowFuzsyQuery docs? -- was: Re: [jira] [Commented] (LUCENE-2667) Fix=20
FuzzyQuery&#39;s defaults, so its fast.<u></u><u></u></span></p></div></div=
>
<p class=3D"MsoNormal"><u></u><u></u>=A0</p>
<div>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;">=93</span><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;=
sans-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">we=20
did not remove functionality</span><span style=3D"font-family:&#39;Calibri&=
#39;,&#39;sans-serif&#39;">=94<u></u><u></u></span></p></div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;">=A0<u></u><u></u></span></p></div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;">Are you saying that=20
full-featured =93classic=94 fuzzy query is still available in the Lucene qu=
ery=20
parser? By default? Or via what option?<u></u><u></u></span></p></div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;"><br>-- Jack=20
Krupansky<u></u><u></u></span></p></div>
<div>
<div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:10pt;font-family:&#39;Tahom=
a&#39;,&#39;sans-serif&#39;">=A0<u></u><u></u></span></p></div>
<div>
<div>
<p style=3D"BACKGROUND:whitesmoke" class=3D"MsoNormal"><b><span style=3D"fo=
nt-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-serif&#39;">From:</span=
></b><span style=3D"font-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-s=
erif&#39;"> <a title=3D"uwe@thetaphi.de" href=3D"mailto:uwe@thetaphi.de" ta=
rget=3D"_blank">Uwe Schindler</a>=20
<u></u><u></u></span></p></div>
<div>
<p style=3D"BACKGROUND:whitesmoke" class=3D"MsoNormal"><b><span style=3D"fo=
nt-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-serif&#39;">Sent:</span=
></b><span style=3D"font-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-s=
erif&#39;">=20
Saturday, November 10, 2012 1:30 PM<u></u><u></u></span></p></div>
<div>
<p style=3D"BACKGROUND:whitesmoke" class=3D"MsoNormal"><b><span style=3D"fo=
nt-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-serif&#39;">To:</span><=
/b><span style=3D"font-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-ser=
if&#39;"> <a title=3D"dev@lucene.apache.org" href=3D"mailto:dev@lucene.apac=
he.org" target=3D"_blank">dev@lucene.apache.org</a>=20
<u></u><u></u></span></p></div>
<div>
<p style=3D"BACKGROUND:whitesmoke" class=3D"MsoNormal"><b><span style=3D"fo=
nt-size:10pt;font-family:&#39;Tahoma&#39;,&#39;sans-serif&#39;">Subject:</s=
pan></b><span style=3D"font-size:10pt;font-family:&#39;Tahoma&#39;,&#39;san=
s-serif&#39;"> RE:=20
FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira] [Commented] (LUCENE-2=
667)=20
Fix FuzzyQuery&#39;s defaults, so its fast.<u></u><u></u></span></p></div><=
/div></div>
<div>
<p class=3D"MsoNormal"><span style=3D"font-family:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;">=A0<u></u><u></u></span></p></div></div>
<div>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt" lang=3D"EN-US">Just use SlowFuz=
zyQuery from contrib, I really don=92t understand where=20
the issue is? The code is available in the sandbox/query module and is avai=
lable=20
in Lucene 4.0. There is no reason to complain here, we did not remove=20
functionality.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style>=A0</span><span style=3D"FONT-FAMILY:=
9;Calibri&#39;,&#39;sans-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt" lang=3D"E=
N-US"><u></u><u></u></span></p>
<div>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">-----<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">Uwe=20
Schindler<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">H.-H.-Meier-Allee=20
63, D-28213 Bremen<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt"><a href=3D"http://www.thetaphi.=
de/" target=3D"_blank">http://www.thetaphi.de</a><u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"FONT-FAMILY:&#39;Calibri&#39;,&#39;sa=
ns-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt">eMail:=20
<a href=3D"mailto:uwe@thetaphi.de" target=3D"_blank">uwe@thetaphi.de</a><u>=
</u><u></u></span></p></div>
<p class=3D"MsoNormal"><span style>=A0</span><span style=3D"FONT-FAMILY:=
9;Calibri&#39;,&#39;sans-serif&#39;;COLOR:#1f497d;FONT-SIZE:11pt"><u></u><u=
></u></span></p>
<div style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:blue 1.5pt solid;PADDIN=
G-BOTTOM:0cm;PADDING-LEFT:4pt;PADDING-RIGHT:0cm;BORDER-TOP:medium none;BORD=
ER-RIGHT:medium none;PADDING-TOP:0cm">
<div>
<div style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:medium none;PADDING-BOT=
TOM:0cm;PADDING-LEFT:0cm;PADDING-RIGHT:0cm;BORDER-TOP:#b5c4df 1pt solid;BOR=
DER-RIGHT:medium none;PADDING-TOP:3pt">
<p class=3D"MsoNormal"><b><span style=3D"font-size:10pt;font-family:&#39;Ta=
homa&#39;,&#39;sans-serif&#39;">From:</span></b><span style=3D"font-size:10=
pt;font-family:&#39;Tahoma&#39;,&#39;sans-serif&#39;"> <a href=3D"mailto:mb=
ennett.ideaeng@gmail.com" target=3D"_blank">mbennett.ideaeng@gmail.com</a> =
[<a href=3D"mailto:mbennett.ideaeng@gmail.com" target=3D"_blank">mailto:mbe=
nnett.ideaeng@gmail.com</a>]=20
<b>On Behalf Of </b>Mark Bennett<br><b>Sent:</b> Saturday, November 10, 201=
2=20
10:18 PM<br><b>To:</b> <a href=3D"mailto:dev@lucene.apache.org" target=3D"_=
blank">dev@lucene.apache.org</a><br><b>Subject:</b>=20
Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira] [Commented]=20
(LUCENE-2667) Fix FuzzyQuery&#39;s defaults, so its=20
fast.<u></u><u></u></span></p></div></div>
<p class=3D"MsoNormal"><span style>=A0<u></u><u></u></span></p>
<p style=3D"MARGIN-BOTTOM:12pt" class=3D"MsoNormal"><span style>Hi=20
guys,<br><br>Not expecting to change minds, but found Robert&#39;s last ema=
il=20
helpful, so wanted to try one more round.<u></u><u></u></span></p>
<div>
<p class=3D"MsoNormal"><span style>On Fri, Nov 9, 2012 at 5:32 PM,=20
Robert Muir &lt;<a href=3D"mailto:rcmuir@gmail.com" target=3D"_blank">rcmui=
r@gmail.com</a>&gt; wrote:<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style>...<br>This is some analysis chain=20
configuration issue.<u></u><u></u></span></p>
<div>
<p class=3D"MsoNormal"><span style>=A0<u></u><u></u></span></p></div>
<div>
<p class=3D"MsoNormal"><span style>Interesting, so you would expect=20
that the seed term *would* go through analysis before it finds the variants=
 in=20
the index?=A0 If it&#39;s supposed to work that way then I can recheck my=
=20
config.=A0 (it wasn&#39;t just lowercase, that was just an=20
example)<br>=A0<u></u><u></u></span></p></div>
<blockquote style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:#cccccc 1pt soli=
d;PADDING-BOTTOM:0cm;MARGIN:5pt 0cm 5pt 4.8pt;PADDING-LEFT:6pt;PADDING-RIGH=
T:0cm;BORDER-TOP:medium none;BORDER-RIGHT:medium none;PADDING-TOP:0cm">
  <p class=3D"MsoNormal"><span style>If it doesn&#39;t work with 100M=20
  documents, i don&#39;t want it in lucene.<u></u><u></u></span></p></block=
quote>
<div>
<p class=3D"MsoNormal"><span style>=A0<u></u><u></u></span></p></div>
<div>
<p class=3D"MsoNormal"><span style>Ah, this is very=20
illuminating.=A0 For scalability, big data, etc, that certainly makes=20
sense.<br><br>But there are many important Intranet search applications tha=
t=20
have far less than 100M docs, but still need the fine-grained control of=20
solr/lucene.=A0 Intranet projects in the 35k to 2M doc range often have eve=
n=20
more precise indexing, filtering and faceting requirements, and solr/lucene=
=20
provides that fine blade.<br><br>Wouldn&#39;t it be more constructive to pi=
ck some=20
number, say 100M, and give that the &quot;big data&quot; moniker.=A0 Then, =
perhaps for=20
things are not that scalable, have some separate area/label but still retai=
n=20
them.=A0 Discarding all use cases &lt; 100M seems=20
draconian.<br><br>=A0<u></u><u></u></span></p></div>
<blockquote style=3D"BORDER-BOTTOM:medium none;BORDER-LEFT:#cccccc 1pt soli=
d;PADDING-BOTTOM:0cm;MARGIN:5pt 0cm 5pt 4.8pt;PADDING-LEFT:6pt;PADDING-RIGH=
T:0cm;BORDER-TOP:medium none;BORDER-RIGHT:medium none;PADDING-TOP:0cm">
  <p class=3D"MsoNormal"><span style><br>I would have the same=20
  opinion if someone wanted unscalable solutions<br>for scoring w/ language=
=20
  models (e.g. not happy with smoothing for<br>unknown probabilities), or i=
f=20
  someone claimed that spatial queries<br>should do slow things because the=
y=20
  don&#39;t currently support<br>interplanetary distances, and so=20
  on.<u></u><u></u></span></p>
  <div>
  <div>
  <p style=3D"MARGIN-BOTTOM:12pt" class=3D"MsoNormal"><span style><font fac=
e=3D"Calibri"></font><br>On Fri, Nov 9, 2012 at 7:52=20
  PM, Mark Bennett &lt;<a href=3D"mailto:mbennett@ideaeng.com" target=3D"_b=
lank">mbennett@ideaeng.com</a>&gt; wrote:<br>&gt;=20
  Hi Robert,<br>&gt;<br>&gt; I acknowledge your &quot;-1&quot; vote, and I&=
#39;m guessing that=20
  your objection is maybe<br>&gt; 70% &quot;scalability&quot;, and only 30%=
=20
  use-case?<br>&gt;<br>&gt; The older Levenstein stuff has been around for =
a=20
  long time, scalable or not,<br>&gt; and already in real=20
  systems.<br>&gt;<br>&gt; You seem to have a very &quot;binary&quot; on co=
de being &quot;in&quot;=20
  or &quot;out&quot;.=A0 Is there any<br>&gt; room in your world-view of co=
de for &quot;gray=20
  code&quot;, unsupported, incubator,<br>&gt; what-have-you?=A0 Maybe anala=
gous to=20
  people who jailbreak their iPhones or<br>&gt; something?<br>&gt;<br>&gt;=
=20
  You&#39;re an important part of the community, and working at Lucid, etc.=
,=20
  and<br>&gt; clearly concerned about software quality.=A0 When smart folks=
=20
  like you have<br>&gt; such sharp opinions I do try to ponder them against=
 my=20
  own circumstances.<br>&gt;<br>&gt; And on the quality of the old code, wa=
s it=20
  just the scalability, or were<br>&gt; there other concerns such as stabil=
ity,=20
  coding style, or possibly<br>&gt; inconsistent results?<br>&gt;<br>&gt; I=
sn&#39;t=20
  the sandbox and admonished reference in Java docs sufficient?<br>&gt;<br>=
&gt;=20
  I&#39;m harping on this because I&#39;m really between a rock and hard pl=
ace,=20
  and<br>&gt; also posted another question.<br>&gt;<br>&gt; Just trying to=
=20
  understand your very strong opinions, and I thank you for<br>&gt; your=20
  patience in this matter.=A0 This issue is either going to fix or break=20
  my<br>&gt; weekend / next-deliverble.<br>&gt;<br>&gt; Sincere thanks,<br>=
&gt;=20
  Mark<br>&gt;<br>&gt;<br>&gt; --<br>&gt; Mark Bennett / New Idea Engineeri=
ng,=20
  Inc. / <a href=3D"mailto:mbennett@ideaeng.com" target=3D"_blank">mbennett=
@ideaeng.com</a><br>&gt;=20
  Direct: <a href=3D"tel:408-733-0387" target=3D"_blank">408-733-0387</a> /=
 Main: 866-IDEA-ENG /=20
  Cell: <a href=3D"tel:408-829-6513" target=3D"_blank">408-829-6513</a><br>=
&gt;<br>&gt;<br>&gt; On=20
  Fri, Nov 9, 2012 at 4:37 PM, Robert Muir &lt;<a href=3D"mailto:rcmuir@gma=
il.com" target=3D"_blank">rcmuir@gmail.com</a>&gt;=20
  wrote:<br>&gt;&gt;<br>&gt;&gt; I&#39;m -1 for having unscalable shit in l=
ucene&#39;s=20
  core. This query should<br>&gt;&gt; have never been=20
  added.<br>&gt;&gt;<br>&gt;&gt; I don&#39;t care if a few people complain =
because=20
  they aren&#39;t using<br>&gt;&gt; lowercasefilter or some other insanity.=
 Fix your=20
  analysis chain. I<br>&gt;&gt; don&#39;t have any sympathy.<br>&gt;&gt;<br=
>&gt;&gt;=20
  On Fri, Nov 9, 2012 at 7:35 PM, Jack Krupansky &lt;<a href=3D"mailto:jack=
@basetechnology.com" target=3D"_blank">jack@basetechnology.com</a>&gt;<br>&=
gt;&gt;=20
  wrote:<br>&gt;&gt; &gt; +1 for permitting a choice of fuzzy query=20
  implementation.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; I agree that we want a=
=20
  super-fast fuzzy query for simple variations, but<br>&gt;&gt; &gt;=20
  I<br>&gt;&gt; &gt; also agree that we should have the option to trade off=
=20
  speed for<br>&gt;&gt; &gt; function.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Bu=
t I=20
  am also sympathetic to assuring that any core Lucene features be<br>&gt;&=
gt;=20
  &gt; as<br>&gt;&gt; &gt; performant as possible.<br>&gt;&gt; &gt;<br>&gt;=
&gt;=20
  &gt; Ultimately, if there was a single fuzzy query implementation that=20
  did<br>&gt;&gt; &gt; everything for everybody all of the time, that would=
 be=20
  the way to go,<br>&gt;&gt; &gt; but<br>&gt;&gt; &gt; if choices need to b=
e=20
  made to satisfy competing goals, we should support<br>&gt;&gt; &gt; going=
 that=20
  route.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; -- Jack Krupansky<br>&gt;&gt;=20
  &gt;<br>&gt;&gt; &gt; From: Mark Bennett<br>&gt;&gt; &gt; Sent: Friday,=
=20
  November 09, 2012 3:48 PM<br>&gt;&gt; &gt; To: <a href=3D"mailto:dev@luce=
ne.apache.org" target=3D"_blank">dev@lucene.apache.org</a><br>&gt;&gt; &gt;=
=20
  Subject: Re: FuzzyQuery vs SlowFuzsyQuery docs? -- was: Re: [jira]<br>&gt=
;&gt;=20
  &gt; [Commented] (LUCENE-2667) Fix FuzzyQuery&#39;s defaults, so its=20
  fast.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Hi Robert,<br>&gt;&gt;=20
  &gt;<br>&gt;&gt; &gt; On Thu, Sep 13, 2012 at 7:39 PM, Robert Muir &lt;<a=
 href=3D"mailto:rcmuir@gmail.com" target=3D"_blank">rcmuir@gmail.com</a>&gt=
; wrote:<br>&gt;&gt;=20
  &gt;&gt;<br>&gt;&gt; &gt;&gt; ...<br>&gt;&gt; &gt;&gt; ... I&#39;m strong=
ly=20
  against having this<br>&gt;&gt; &gt;&gt; unscalable garbage in lucene&#39=
;s=20
  core.<br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;&gt; There is no use case for =
ed=20
  &gt; 2, thats just crazy.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &=
gt; I=20
  promise you there ARE use cases for edit distances &gt; 2,=20
  especially<br>&gt;&gt; &gt; with<br>&gt;&gt; &gt; longer words.=A0 Due to=
=20
  NDA I can&#39;t go into details.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; Also e=
d&gt;2=20
  can be useful when COMBINING that low-quality part of the<br>&gt;&gt; &gt=
;=20
  search<br>&gt;&gt; &gt; with other sub-queries, or additional business=20
  rules.=A0 Maybe instead of<br>&gt;&gt; &gt; boiling an ocean this lets yo=
u=20
  just boil the sea.=A0 ;-)<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; I won&#39;t c=
omment=20
  on the quality of the older Levenstein code, or the<br>&gt;&gt; &gt;=20
  likely<br>&gt;&gt; &gt; very slow performance, nor where the code should =
live,=20
  etc.<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; But your statement about &quot;no =
use case=20
  for ed &gt; 2&quot; is simply not true.<br>&gt;&gt; &gt; (whether you&#39=
;d agree with=20
  any of them or not is certainly another<br>&gt;&gt; &gt; matter)<br>&gt;&=
gt;=20
  &gt;<br>&gt;&gt; &gt; I understand your concerns about not having it be t=
he=20
  default.=A0 (or<br>&gt;&gt; &gt; maybe<br>&gt;&gt; &gt; having a giant=20
  warning message or something, whatever)<br>&gt;&gt; &gt;<br>&gt;&gt; &gt;=
&gt;=20
  --<br>&gt;&gt; &gt;&gt; <a href=3D"http://lucidworks.com" target=3D"_blan=
k">lucidworks.com</a><br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;&gt;=20
  ---------------------------------------------------------------------<br>=
&gt;&gt;=20
  &gt;&gt; To unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@lucene=
.apache.org" target=3D"_blank">dev-unsubscribe@lucene.apache.org</a><br>&gt=
;&gt;=20
  &gt;&gt; For additional commands, e-mail: <a href=3D"mailto:dev-help@luce=
ne.apache.org" target=3D"_blank">dev-help@lucene.apache.org</a><br>&gt;&gt;=
=20
  &gt;&gt;<br>&gt;&gt; &gt;<br>&gt;&gt;<br>&gt;&gt;=20
  ---------------------------------------------------------------------<br>=
&gt;&gt;=20
  To unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@lucene.apache.o=
rg" target=3D"_blank">dev-unsubscribe@lucene.apache.org</a><br>&gt;&gt;=20
  For additional commands, e-mail: <a href=3D"mailto:dev-help@lucene.apache=
.org" target=3D"_blank">dev-help@lucene.apache.org</a><br>&gt;&gt;<br>&gt;<=
br><br>--------------------------------------------------------------------=
-<br>

To=20
  unsubscribe, e-mail: <a href=3D"mailto:dev-unsubscribe@lucene.apache.org"=
 target=3D"_blank">dev-unsubscribe@lucene.apache.org</a><br>For=20
  additional commands, e-mail: <a href=3D"mailto:dev-help@lucene.apache.org=
" target=3D"_blank">dev-help@lucene.apache.org</a><u></u><u></u></span></p>=
</div></div></blockquote></div>
<p class=3D"MsoNormal"><span style>=A0<u></u><u></u></span></p></div></div>=
</div></div></div></div></div></div></div></div></div></div>
</blockquote></div><br>

--90e6ba614138985f0804ce3c7101--