lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Murzaku" <murz...@earthlink.net>
Subject RE: Normalization
Date Wed, 13 Mar 2002 12:35:42 GMT
Would it make sense to allow a full regex in the matching part? Could
use regex or oromatcher packages. Don't know how that would affect your
hashing though...

Alex

-----Original Message-----
From: Rodrigo Reyes [mailto:reyes@charabia.net] 
Sent: Tuesday, March 12, 2002 5:16 PM
To: Lucene Developers List
Subject: Re: Normalization


Hi Alex,

 Thanks for your feedback,

> The rules seem to be applied sequentially and each rule modifies the 
> output of the previous one. This is kind of risky especially if the 
> rule set becomes too big. The author of the rules needs to keep this 
> present at all times. For example, there is a rule for "ons$" and a 
> following one for "ions$". The second one will never be matched 
> because the string will be changed by the first rule it matches. Even 
> though aimons and aimions should be reduced to "em" they end up into 
> "em" and "emi". Maybe this could be solved if you do longest match 
> first.

You're right, rule-masking is a real problem, but not exactly in the
example you give.

There rules are not applied sequentially as they appear in the file,
they are stored in a (kind-of) transducer where the first state is the
first letter of the focused string (i.e. not the right nor the left
contextual rules). The rules are then hashed according to the first
letter of the central string. The normalizer iterates through the
letters of the string, applying the smallest subset of the rules on each
letter, and reducing the string as it goes. In your example, the rules
for "ions$" would be applied when the normalizer reaches the letter i,
reducing the string and therefore the rule for "ons$" cannot be applied.

However, when the strings have the same beginning char, you're right,
there is a risk of having a rule never applied. For exemple "on" then
"ont", the latter is unlikely to be used, ever. As you state it, this
can be solved by sorting the rules of a given subset with the longest
first (which is a very good point, I'll fix it on the source, thanks!).

> The other consequence of the sequentiality is the possible change of 
> context. Some rules could never be reached therefore. Don't remember 
> how we got around this.

The context is never changed in a single pass, as there is an input
buffer and an output buffer. Rules read from the input buffer, and write
in the output buffer. This way, the context is never modified. The
description language in fact allows to have at the same time rules that
rely on the context not being changed by other rules (this security is
addressed by the input/output double buffer) and rules that rely on the
changes made by other rules (with multiple passes on the data, using the
#start keyword).

Rodrigo



--
To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message