Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Date: Mon, 11 Mar 2002 14:15:23 -0800
From: Brian Goetz <brian@quiotix.com>
To: Lucene Developers List <lucene-dev@jakarta.apache.org>
Subject: Re: Normalization
Message-ID: <20020311141523.A5351@lx.quiotix.com>
References: <007b01c1c93f$930bc7f0$0600a8c0@spaghetti>
 <000701c1c948$a3d7ed30$1701000a@toronto>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <000701c1c948$a3d7ed30$1701000a@toronto>;
 from murzaku@earthlink.net on Mon, Mar 11, 2002 at 05:03:51PM -0500

> As I have said before in this list, this gets way off of Lucene. The
> normalizer, or the morphologic analyzer or the phonetic transducer, or
> the stemmer, or the thesaurus -- they all could be stand-alone products.

I've got to disagree with you here on two points.  

1.  Lucene's architecture is all about flexibility and plug-ins.
Rodrigo's proposal is entirely consistent with that -- offering better
tools for building Analyzers.  (Contrast this with some of the
proposals that have been flying for building crawlers and such --
those truly are off the mark as tools to put INTO Lucene.)

2.  The vast majority of users will use one of the provided analyzers
(SimpleAnalyzer, StandardAnalyzer.)  Fair or not, Lucene will be
judged on how well it does on typical documents using the "default"
tools.  Right now, the default tools are unnecessarily weak.

> I used to make such products many years ago and there are companies that
> still sell such tools (e.g. inXight). I like the way Lucene is now: the
> included analyzer/filter could be used as-is but also allows everyone to
> use whatever else they need. One could use the German or Porter stemmer,
> but anyone could easily use other analyzers as well (for example all the
> languages snowball offers.) This is fine as long as Lucene remains a
> library.

My understanding of Rodrigo's idea (filtered through my own view of
the project philosophy) is that he's proposing an "Analyzer
Construction Kit".  That seems like a great idea to me, and while we
could say "put it in /contrib", it really does seem like the sort of
thing we want to have.  

> As Brian says, what matters is to keep the analyzers synchronized
> between indexing and searching. Is there a way to force this?

Having it generate Analyzer source code seems like a pretty good way
to me.  

> I rather prefer changes of the core engine 

Is this a change to the core engine, or an additional tool that can
be plugged into the engine?  I think the latter.

> that accommodate all/many
> possible "normalizations" like what Joanne Sproston contributed some
> months ago i.e. the possibility to return more than one word for a
> filtered word and store them in the same document position (useful for
> synonyms and for agglutinative languages like Finish, Turkish etc.)

That's a change to the core.  Might be useful, but is a more intrusive
change.

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>