From "Alex Murzaku" <>
Subject RE: Normalization
Date Mon, 11 Mar 2002 22:03:51 GMT
As I have said before in this list, this gets way off of Lucene. The
normalizer, or the morphologic analyzer or the phonetic transducer, or
the stemmer, or the thesaurus -- they all could be stand-alone products.
I used to make such products many years ago and there are companies that
still sell such tools (e.g. inXight). I like the way Lucene is now: the
included analyzer/filter could be used as-is but also allows everyone to
use whatever else they need. One could use the German or Porter stemmer,
but anyone could easily use other analyzers as well (for example all the
languages snowball offers.) This is fine as long as Lucene remains a

As Brian says, what matters is to keep the analyzers synchronized
between indexing and searching. Is there a way to force this?

I rather prefer changes of the core engine that accommodate all/many
possible "normalizations" like what Joanne Sproston contributed some
months ago i.e. the possibility to return more than one word for a
filtered word and store them in the same document position (useful for
synonyms and for agglutinative languages like Finish, Turkish etc.)

I'd like to talk about the normalization (aka filter) processing of a
string being indexed/searched, and how it is done in lucene. I'll end
with a proposal for another method of handling it.

The lucene engine includes some filter which purpose is to remove some
meaningless morphological mark, in order to extend the document
retrieval with pertinent documents that do not match the exact forms
used by the users in their queries.

There are some filters provided off-the-shelf along with lucene, a
Porter stemmer and a stemmer specific to german. However, my point is
that not only there can't be a single stemmer for all language (this is
obvious for everybody I guess), but ideally there would be several
filter for a same language. For example, the Porter filter is fine for
standard english, but rather inapropriate for proper nouns. At the
contrary, the soundex is probably fine for names, but it generates
innacurate results when used as a filter on a whole document. Generally
speaking, there may be very different strategies when normalizing text,
whether it be highly aggressive (like the
soundex) or rather soft (like a simple diacritics removal). But this is
up to the designer of the search engine to choose carefully its strategy
according to his/her audience and targetted document. It is even
possible to mix several strategies by including an information
extraction system that would additionnaly store in separate indexes the
proper nouns, the dates, the places, etc.

In my opinion, stemming is not the perfect, unique solution for
normalization. For example, I personnaly prefer a normalization that
includes stemming, but also some light phonetic simplification that
discards the differences of close phonemes (like the french
é/è/ê/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good
results on texts issued from usenet (while it may be a bit too
aggressive for newspaper texts written by journalists).

Well, in fact my main point is the following : having one filter per
language is wrong. Second point is: having the filter algorithm
hard-coded in a programming language is wrong as well. There should be a
simple way of specifying a filter in a simple, dedicated language. In
this way, the snowball project is really interesting as it solves the
issue. In my mind, there should be mainly a normalizer engine, with many
configuration files, easy to modify to implement or adapt a filter. This
is an important issue, as the accuracy of the search engine is directly
linked to the normalization strategy.

However, an important point is also the ease of use of such a language.
In my attempt to build such a simple description language, I came with
something that I hope is quite simple, yet powerful enough : something
that just specify the letters to transform, the right and left context,
and the replacement string. In my opinion, this covers 80% of the need
for (at
least) european languages. I implemented it (in java) and wrote a
normalizer for french, which stems and phonetically simplifies its

Just as an example, here is a small excerpt of my french normalizer
(written in the toy language I implemented):
:: sh ::        > ch
:: sch ::       > ch
// transform the "in"/"yn" into the same string, when not pronounced
 :: in :: [~aeiouymn] > 1
[~aeiouy] :: yn :: [~aeiouynm]  > 1   // "syndicat", "synchro", but not
:: ives :: $ > if    // "consécutives"

Before the first "::" is the left context, after the second "::" is the
right context. "$" indicates a word boundary.

Some features are still missing in my implementation, such as putting
constraints on the word length (i.e. to apply a transformation only on
words that have more than x letters) or the like, but I am globally
satisfied with it.

As an exemple of result (the two input forms are pronounced identically
in french, although the second is not written correctly):
read: <démesuré> result: <demezur>
read: <daimesurré> result: <demezur>

Before going on the process of submitting it to the lucene project, I'd
like to hear your comments on the approach. Of high concern is the
language used to describe the normalization process, as I am not plenty
satisfied of it, but hey it's hard to find something really simple yet
just expressive enough. Any idea ?


