Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 81368 invoked from network); 11 Mar 2002 22:15:21 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 11 Mar 2002 22:15:21 -0000 Received: (qmail 28416 invoked by uid 97); 11 Mar 2002 22:15:25 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 28380 invoked by uid 97); 11 Mar 2002 22:15:24 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 28341 invoked from network); 11 Mar 2002 22:15:24 -0000 Date: Mon, 11 Mar 2002 14:15:23 -0800 From: Brian Goetz To: Lucene Developers List Subject: Re: Normalization Message-ID: <20020311141523.A5351@lx.quiotix.com> References: <007b01c1c93f$930bc7f0$0600a8c0@spaghetti> <000701c1c948$a3d7ed30$1701000a@toronto> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <000701c1c948$a3d7ed30$1701000a@toronto>; from murzaku@earthlink.net on Mon, Mar 11, 2002 at 05:03:51PM -0500 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N > As I have said before in this list, this gets way off of Lucene. The > normalizer, or the morphologic analyzer or the phonetic transducer, or > the stemmer, or the thesaurus -- they all could be stand-alone products. I've got to disagree with you here on two points. 1. Lucene's architecture is all about flexibility and plug-ins. Rodrigo's proposal is entirely consistent with that -- offering better tools for building Analyzers. (Contrast this with some of the proposals that have been flying for building crawlers and such -- those truly are off the mark as tools to put INTO Lucene.) 2. The vast majority of users will use one of the provided analyzers (SimpleAnalyzer, StandardAnalyzer.) Fair or not, Lucene will be judged on how well it does on typical documents using the "default" tools. Right now, the default tools are unnecessarily weak. > I used to make such products many years ago and there are companies that > still sell such tools (e.g. inXight). I like the way Lucene is now: the > included analyzer/filter could be used as-is but also allows everyone to > use whatever else they need. One could use the German or Porter stemmer, > but anyone could easily use other analyzers as well (for example all the > languages snowball offers.) This is fine as long as Lucene remains a > library. My understanding of Rodrigo's idea (filtered through my own view of the project philosophy) is that he's proposing an "Analyzer Construction Kit". That seems like a great idea to me, and while we could say "put it in /contrib", it really does seem like the sort of thing we want to have. > As Brian says, what matters is to keep the analyzers synchronized > between indexing and searching. Is there a way to force this? Having it generate Analyzer source code seems like a pretty good way to me. > I rather prefer changes of the core engine Is this a change to the core engine, or an additional tool that can be plugged into the engine? I think the latter. > that accommodate all/many > possible "normalizations" like what Joanne Sproston contributed some > months ago i.e. the possibility to return more than one word for a > filtered word and store them in the same document position (useful for > synonyms and for agglutinative languages like Finish, Turkish etc.) That's a change to the core. Might be useful, but is a more intrusive change. -- To unsubscribe, e-mail: For additional commands, e-mail: