Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 26625 invoked from network); 11 Mar 2002 20:58:54 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 11 Mar 2002 20:58:54 -0000 Received: (qmail 20079 invoked by uid 97); 11 Mar 2002 20:58:58 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 20013 invoked by uid 97); 11 Mar 2002 20:58:56 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 20000 invoked from network); 11 Mar 2002 20:58:56 -0000 Message-ID: <007b01c1c93f$930bc7f0$0600a8c0@spaghetti> From: "Rodrigo Reyes" To: Subject: Normalization Date: Mon, 11 Mar 2002 21:59:01 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Hi, I'd like to talk about the normalization (aka filter) processing of a string being indexed/searched, and how it is done in lucene. I'll end with a proposal for another method of handling it. The lucene engine includes some filter which purpose is to remove some meaningless morphological mark, in order to extend the document retrieval with pertinent documents that do not match the exact forms used by the users in their queries. There are some filters provided off-the-shelf along with lucene, a Porter stemmer and a stemmer specific to german. However, my point is that not only there can't be a single stemmer for all language (this is obvious for everybody I guess), but ideally there would be several filter for a same language. For example, the Porter filter is fine for standard english, but rather inapropriate for proper nouns. At the contrary, the soundex is probably fine for names, but it generates innacurate results when used as a filter on a whole document. Generally speaking, there may be very different strategies when normalizing text, whether it be highly aggressive (like the soundex) or rather soft (like a simple diacritics removal). But this is up to the designer of the search engine to choose carefully its strategy according to his/her audience and targetted document. It is even possible to mix several strategies by including an information extraction system that would additionnaly store in separate indexes the proper nouns, the dates, the places, etc. In my opinion, stemming is not the perfect, unique solution for normalization. For example, I personnaly prefer a normalization that includes stemming, but also some light phonetic simplification that discards the differences of close phonemes (like the french �/�/�/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good results on texts issued from usenet (while it may be a bit too aggressive for newspaper texts written by journalists). Well, in fact my main point is the following : having one filter per language is wrong. Second point is: having the filter algorithm hard-coded in a programming language is wrong as well. There should be a simple way of specifying a filter in a simple, dedicated language. In this way, the snowball project is really interesting as it solves the issue. In my mind, there should be mainly a normalizer engine, with many configuration files, easy to modify to implement or adapt a filter. This is an important issue, as the accuracy of the search engine is directly linked to the normalization strategy. However, an important point is also the ease of use of such a language. In my attempt to build such a simple description language, I came with something that I hope is quite simple, yet powerful enough : something that just specify the letters to transform, the right and left context, and the replacement string. In my opinion, this covers 80% of the need for (at least) european languages. I implemented it (in java) and wrote a normalizer for french, which stems and phonetically simplifies its input. Just as an example, here is a small excerpt of my french normalizer (written in the toy language I implemented): :: sh :: > ch :: sch :: > ch // transform the "in"/"yn" into the same string, when not pronounced "inn" :: in :: [~aeiouymn] > 1 [~aeiouy] :: yn :: [~aeiouynm] > 1 // "syndicat", "synchro", but not "payer" :: ives :: $ > if // "cons�cutives" Before the first "::" is the left context, after the second "::" is the right context. "$" indicates a word boundary. Some features are still missing in my implementation, such as putting constraints on the word length (i.e. to apply a transformation only on words that have more than x letters) or the like, but I am globally satisfied with it. As an exemple of result (the two input forms are pronounced identically in french, although the second is not written correctly): read: result: read: result: Before going on the process of submitting it to the lucene project, I'd like to hear your comments on the approach. Of high concern is the language used to describe the normalization process, as I am not plenty satisfied of it, but hey it's hard to find something really simple yet just expressive enough. Any idea ? Rodrigo http://www.charabia.net -- To unsubscribe, e-mail: For additional commands, e-mail: