lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rodrigo Reyes" <re...@charabia.net>
Subject Re: Normalization
Date Tue, 12 Mar 2002 22:15:31 GMT
Hi Alex,

 Thanks for your feedback,

> The rules seem to be applied sequentially and each rule modifies the
> output of the previous one. This is kind of risky especially if the rule
> set becomes too big. The author of the rules needs to keep this present
> at all times. For example, there is a rule for "ons$" and a following
> one for "ions$". The second one will never be matched because the string
> will be changed by the first rule it matches. Even though aimons and
> aimions should be reduced to "em" they end up into "em" and "emi". Maybe
> this could be solved if you do longest match first.

You're right, rule-masking is a real problem, but not exactly in the example
you give.

There rules are not applied sequentially as they appear in the file, they
are stored in a (kind-of) transducer where the first state is the first
letter of the focused string (i.e. not the right nor the left contextual
rules). The rules are then hashed according to the first letter of the
central string. The normalizer iterates through the letters of the string,
applying the smallest subset of the rules on each letter, and reducing the
string as it goes. In your example, the rules for "ions$" would be applied
when the normalizer reaches the letter i, reducing the string and therefore
the rule for "ons$" cannot be applied.

However, when the strings have the same beginning char, you're right, there
is a risk of having a rule never applied. For exemple "on" then "ont", the
latter is unlikely to be used, ever. As you state it, this can be solved by
sorting the rules of a given subset with the longest first (which is a very
good point, I'll fix it on the source, thanks!).

> The other consequence of the sequentiality is the possible change of
> context. Some rules could never be reached therefore. Don't remember how
> we got around this.

The context is never changed in a single pass, as there is an input buffer
and an output buffer. Rules read from the input buffer, and write in the
output buffer. This way, the context is never modified. The description
language in fact allows to have at the same time rules that rely on the
context not being changed by other rules (this security is addressed by the
input/output double buffer) and rules that rely on the changes made by other
rules (with multiple passes on the data, using the #start keyword).

Rodrigo



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message