directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <aok...@bellsouth.net>
Subject Re: Normalizer vs. Comparator
Date Tue, 06 Sep 2005 19:03:20 GMT
Stefan Zoerner wrote:

> Hi all!

Hey sorry for taking so long to respond.

> Here is the whole story:
> I faced the problem that the compare operation does not adhere the 
> matching rules. Therefore I successfully modified the CompareHandler 
> class in org.apache.ldap.server.protocol to do this (whether this is 
> the best place to fix this problem is not the question here).

Ok some theory behind these constructs might shed some light on what 
role they serve in the server. 

Most LDAP servers have a means to extend the schema however this means 
is extremely limited when it comes to defining new Syntaxes or new 
MatchingRules.  Really these constructs are often built into the server 
and cannot be changed without code changes.

When I started designing the schema subsystem of ApacheDS (still not 
finished) I wanted her to be able to be extended for new Syntaxes and 
new MatchingRules.  To do this I had to understand the fundamental 
components needed to represent new matchingRules and syntaxes.  For 
syntaxes I created an interface called SyntaxChecker.  Every syntax must 
have a SyntaxChecker in order for the schema subsystem to check for 
proper attribute value syntax.  This SyntaxChecker can be a simple regex 
or an entire parser.  As long as the interface is adhired to the schema 
subsystem can use it to determine if correct values are being used for 
attributeTypes based on a schema.

The other half dealing with Comparators and Normalizers is much more 
complex and for this you must really understand what a matchingRule 
does.  The server uses matching rules to determine equality and 
ordering.  Before it can do this string prep must be run on some values 
(normalization) to remove the chance for varience to enter the picture.  
Hence matchingRules can be broken down into Comparators and 
Normalizers.  Some may think a Normalizer is syntax specific however how 
you want to match effects normalization not the syntax.  For example if 
I have an attribute that is a simple string and I want to perform a case 
insensitive match then the normalization changes from a case sensitive 
match.  This shows how normalization is specific to matching an not just 
a syntax.

Anyways Normalizers and Comparators are the basis to matchingRules.  A 
new matchingRule must have these defined for its OID as you probably saw.

> It worked better, but not all matching rules satisfied my needs (some 
> are missing). 

Yep we have not filled in any of these really.  Just some very critical 
ones so the directory can operate.  We need help in filling these in.

> One of these is telephoneNumberMatch, and I changed 
> SystemComparatorProducer to replace ComparableComparator with 
> something, that implements the missing matching rule.
>
Cool.  This is exactly what we need to do.

> Two options here to implement this Comparator:
> 1. just implement this interface Comparator, call it 
> TelephoneNumberComparator
> 2. Create a Normalizer for telephone numbers (removing white space and 
> hyphens, transform to e.g. lower case), and instantiate a 
> NormalizingComparator in SystemComparatorProducer which uses it
>
Right these would be the two steps to follow.  One for the Comparator 
and another for the normalizer.

> This leads me (finally) to the question, where normalizers are 
> intended to use. I do not want my telephone number get "normalized" 
> before storing it, because that would delete the formatting, which 
> people might like to preserve.

Good question.  Let me try to answer this ...

Normalization is critical while attempting to match two values 
together.  Sometimes there is extra white space and it can be removed to 
better enable correct comparisons.  Sometimes normalization is not even 
needed if the syntax is very rigid without any room for case or space 
variance.  Consider matching for cn=Stefan Zoerner which is in the 
directory (this is what the user who added an entry put as the cn 
attribute value).  Now another user that is searching for these entries 
may ask for cn=STEFAN    ZOERNER with 3 spaces between STEFAN and 
ZOERNER.  The two users may be the same or different users.  The second 
user should be able to to pull the same entries regardles of which 
filter he uses below:

(cn=STEFAN   ZOERNER)
(cn= Stefan ZOerner)
(cn=stefan                    zoerner)

So a normalizer would come into play here by generating a canonical 
representation of these inputs.  ApacheDS by  default case normalizes by 
reducing case to lowercase and then comparing the filter string with the 
normalized attribute value stored within the directory: this is only 
done for matching rules that ignore case.  For whitespace normalization 
ApacheDS tries to follow the string prep operation defined in various 
ietf documents.  However I'm sure we fall short.  The general rule of 
thumb for ApacheDS is to whitespace normalize while retaining string 
tokenization order.  Meaning we do a deep trim of values replacing 
whitespace with a single space character.  Whitespace on the ends are 
discarded.  This btw is only done when space and whitespace in general 
is not escaped.

Hope this helps,
Alex


Mime
View raw message