lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Murzaku" <>
Subject RE: Normalization
Date Wed, 13 Mar 2002 21:33:55 GMT
Hi Rodrigo and Brian,

The power of regex is desirable especially in the left and right context
matching. As it is, you need to write a lot of little rules for every
possible combination. A regex instead would allow for just one rule
covering most of the combinations. For example, you have a rule that
would remove the "ation(s)" at the end of a word. That creates a stem
like "n" for "nation(s)". This kind of problem could be resolved by
having a way to define units bigger than just one letter, for example a

The other feature that I have found useful is the possibility to create
classes of sounds (letters). You go around it with enumeration --
sometimes it makes sense to be able to define groups of consonants or
vowels etc..

But at the end, you are right, regex is too powerful. My point of view
is that this tool will be used by people that once they spend the time
to learn and understand it, they will always aim at covering as many
linguistic exceptions as possible. The present limitations could become

Just my two lipas.


-----Original Message-----
From: Rodrigo Reyes [] 
Sent: Wednesday, March 13, 2002 2:02 PM
To: Lucene Developers List
Subject: Re: Normalization

Hi Alex,

> Would it make sense to allow a full regex in the matching part? Could 
> use regex or oromatcher packages. Don't know how that would affect 
> your hashing though...

 I'd give an answer not really different than Brian's : you don't really
need all that power. Although I don't have significant experience with
non-european languages, this is not the first tool of the kind I write,
and to my knowledge you don't really need more power than that. At
least, not the kind of additional expressiveness that can be provided by
regexps (although, as I mentionned in another mail, you may need
restriction on the size of the string input or output, for example
soundex specifies a 4-letter limitation that is not currently addressed
by the language).

However, I'd be very interested in hearing about counter-example that
would need. The only counter-example I could find was the annoyance of
having to remove sequences of the same letter, which was unnice, so I
added an option called "uniquify" to do the job more easely (as you can
see in the soundex or french normalizer).


To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message