lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Goetz <>
Subject Re: Normalization
Date Mon, 11 Mar 2002 21:14:46 GMT
Great stuff, Rodrigo!  Welcome.  

Your comments are right on the mark.  While Lucene has a great
architecture for building flexible text processing systems, the
supplied tokenizers and analyzers aren't perfect.  Fortunately,
its easy to add new ones.

> Well, in fact my main point is the following : having one filter per
> language is wrong. Second point is: having the filter algorithm hard-coded
> in a programming language is wrong as well. There should be a simple way of
> specifying a filter in a simple, dedicated language. In this way, the
> snowball project is really interesting as it solves the issue. In my mind,
> there should be mainly a normalizer engine, with many configuration files,
> easy to modify to implement or adapt a filter. This is an important issue,
> as the accuracy of the search engine is directly linked to the normalization
> strategy.

I'm all for domain-specific languages, but you have to be careful of
making the filter language too easy to change, since if the filter is
changed after the archive is created and documents indexed, searches
will stop working.  So any such filtering language should produce code
(or data) that becomes part of the program, rather than simply a
configuration file along with the program.  In other words, it should
be considered source code, not configuration data. 

> Before going on the process of submitting it to the lucene project,
> I'd like to hear your comments on the approach. Of high concern is
> the language used to describe the normalization process, as I am not
> plenty satisfied of it, but hey it's hard to find something really
> simple yet just expressive enough. 

Great idea!  We'd love to have something like this.  This is the sort
of contribution we're really looking for.  I'm willing to help write 
a parser for it if the langauge gets complicated.  

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message