directory-api mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: Prepare String
Date Wed, 30 Mar 2016 11:23:56 GMT
Le 28/03/16 12:23, Emmanuel Lécharny a écrit :
> Hi guys,
>
> I'm now working on the PrepareString part. It need a bit of work, as we
> don't correctly handle spaces. We also have to remove the escaping we do
> there.
>
> That is what I'm working on atm.
A bit more of what's going on...

The String Preparation is specified in RFC 4518. It's a prcoess that
involves 6 steps :


      1) Transcode
      2) Map
      3) Normalize
      4) Prohibit
      5) Check bidi
      6) Insignificant Character Handling

The first phase is just a transformation of a byte[] to a String, which
is done through a call to Strings.utf8ToString( bytes ). The good thing
is that Java stores the String using Unicode.

The Map phase is a bit more complex, as we have to go through all the
chars, and depending on the fact that the Syntax is case sensitive or
not, it will transform the char to some others so that theyc an be
compared safely. There is a long list of special chars to handle (around
1000).

The Normalize phase consist on a transformation of the String to a
String respecting the NFKC form, described here :
http://www.unicode.org/reports/tr15/tr15-22.html#Specification. This is
also implemented in Java, so we use the Normalizer.normalize( mapped,
Normalizer.Form.NFKC ) method, if necessary.

The Prohibit phase is about checking every char to check if they are all
valid. There are a few hundreds prohibited chars.

The Check Bidi phase is about dealing with bi-directional characters
(arabic, for instance). "Bidirectional characters are ignored." says the
RFC, so be it :-)

The insignificant character handling phase is the last one, where we
remove useless spaces or some other specific chars, in various type of
values.


In order to speddup the process, which is quite expensive, the idea is
to assume the value to be ASCII first. In this case, the Normalize,
Prohibit and most of the Map phases can be zapped. We can safely design
a simplest method that will work fast for all those phases, throwing an
exception when we meet a non-ASCII char. If so, we fail over to the more
complex process that involves all the phases and the various String
creations. Somehow, this is the same process than what we have for DNs :
FastDnParser and ComplexDnParser.


One thing thwat will be completely removed from the prepareString
implementation is the escaping we currently (wrongly) do. It is the not
the place to do that.


Bottom line, this String preparation will completely replace the
Normalizers we are using. They are useless parts of our schema.


last, not least, as this is a COSTLY operation, this function will only
be called when needed (ie for AT we know are used in Index, or in teh
DN's RDN, or when a Filter uses it). That will save a hell lot of CPU.
The consequences is that most of the values we receive or send will
*not* we converted to String, we will just keep the byte[] value. That
is the main source of CPU save.

Expect the server and teh API to be kind of impacted :-)




Mime
View raw message