directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: Directory Studio: Backslash in DN breaks studio
Date Fri, 11 Mar 2016 00:18:07 GMT
Le 10/03/16 22:58, Stefan Seelmann a écrit :
> On 03/09/2016 07:59 PM, Emmanuel Lécharny wrote:
>> Le 09/03/16 18:54, Philip Peake a écrit :
>> Can you be a bit more explicit ?
>>
> Probably same cause as in
> https://issues.apache.org/jira/browse/DIRSTUDIO-1087 and
> https://issues.apache.org/jira/browse/DIRSERVER-2109
>
I took some time last week-end to re-think the whole problem. There are
a lot of things we are doing wrong, IMO. Don't get me wrong though :
most of the time, it simply works.

FTR, I send this mail to the dev list, copying it to the users list.

<this is going to be a long mail...>

First of all, we need to distinguish the clients from the server. They
are to different beasts, and we should assume the server *always*
receive data that are potentially harmful and incorrect.

Then we also need to distinguish String values and Binary values. The
reason we make this distinction is that String values are going to be
encoded in UTF-8 thus using multi-bytes, and also because we need to
convert them from UTF-8 to Unicode (and back).

Let's put aside the binary data at the moment.

The server
==========

Value
-----

We receive UTF-8 Strings, we convert them to Unicode and now we can
process them in Java. We do need this conversion because we need to
check the values before injecting them in the backend. Doing such checks
in UTF-8 would be very impracticable.

There is one critical operation that is done on values when we process
them : we most of the time need to compare them to another value :
typically, when we have an index associated with this value, or when we
have a search filter. Comparing two values is not as simple as doing an
lexicographic comparison sadly. We need to 'prepare' the values
accordingly to some very specific rules, and we should also 'normalize'
those values accordingly to some syntax.

A comparison is done following this process :

Val 1 -> normalization -> preparation-+
                                       \
                                        .--> Comparison
                                       /
Val 2 -> normalization -> preparation-+

We can save some processing if one of the two values has already been
normalized or prepared. Actually, we should do that only once for each
value : when they are injected into the server for the first time. But
doing so would also induce some constraint : disk usage (saving many
forms of a data cost space, and time when it comes to read them from
disk. This is all about balance...).

Anyway, most of the time, we get a value and we just need to store it
into the backend after having checked its syntax. And that's the key :
checking the syntax requires some preparation. Here is how we proceed
when we just need to chck teh syntax :

Value --> normalization --> syntax check

There is no string preparation.

The normalization is specific to each AttributeType. The String
Preparation is the same for all the values.


Now, there are two specific use cases : filters, and DN.


Filter
------

A filter always contains a String that needs to be processed to give a
tuple : <attributeType, value>. There are rules that must be applied to
transform the incoming filter to this tuple. Once we have created this
tuple, we can normalize and prepare the tuple's value : something that
might be complex, especially when dealing with substring matches.

So for filter, the process is :

fliter -> preProcessing -> Tuple<AttributeType, Value> -> normalization
-> preparation


The String preparation is required because the filter's value will be
compared with what we fetch from the backend.

DN
--

The DN is not a String. It's a list of RDN, where each RDN is a list of
AVA, where each AVA is a tuple <attributeType, Value> Although, as a
filter, when it's received, or stored, it's as a String, and there are
some specific rules to follow to get the String being transformed to
RDNs. Bottom line, the DN preprocessing is the following :

DN String --> preProcessing -> Rdns, AVA, Tuple<AttributeType, value>
[-> normalization -> preparation] (for each AVA)

Again, the String preparation is needed because we will store the RDN
into an index, and that requires some comparison (note that it's not
always the case, typically for attributeType with a DN syntax).


Comparing values
----------------

We saw that we need to normalize and prepare values before being able to
compare them. A good question would be : do we need to prepare the
String beforehand or when we need to compare values ? That's quite
irrelevant : it's a choice that need to be make at some point, but it
just impacts the performance and the storage size. We can consider that
when we start comparing two values, they are already prepared (either
because we have stored a prepared version of the String, or because we
have just prepared teh String on the fly before calling the compare method).



The Client
==========

I will just talk about the Ldap API here, I'm not interested in any
other client.

We have two flavors : schema eware and schema agnostic. We also have to
consider two aspects : when we send data to the server, and when we
process the result.


Schema agnostic client
----------------------

There is no so much we can do here. we have no idea about what can be
the value's syntax, so we can't normalize the value. Bottom line, here
is the basic processing of a value sent to the server :

- we don't touch the values. At all. We just convert them from Unicode
to UTF-8
- we pre-process filters to feed the SearchRequest. values are unescaped
(ie the escaped chars are replaced by their binary counterpart)
- we don't touch the DN

Whe values are received from the server, we need to process the data
this way :

- we don't touch the values, we just convert them from UTF-8 to Unicode
- we don't touch the DN : it's already in String format, we just convert
them from UTF-8 to Unicode

Schema aware client
-------------------

This is more complex, because now, we can process the values before
sending them to the server. This put some load on the client side
instead of pounding the server with incorrect data that will get
rejected anyway.

- Values : we normalize them, prepare them and check their syntax. At
the end, we convert the original value from Unicode to UTF-8. As we can
see, we lose the normalized and prepared value.
- Filter : we unescape them, then we convert them to UTF-8
- Dn : we parse it, unescape it, normalizing each value, and at the end,
if the DN is valid, we send the original value as is, after having
converting it to UTF-8


As we can see, all what we do is to check the values before sending them
to the remote server, except for the filter.

For the received values, we first convert them to Unicode and that's
pretty much it.


Escaping
--------

DN and Filters need some pre-processing called unescaping when we have
to transform them from a String to an internal instance. For Filter,
this is always done on the client side, for the DN is done on the server
side. The idea is to transform those values from a String (human
readable) form to a binary form.


What we do wrong
----------------

We will only focus on the schema aware API here. This is what we use on
the server side anyway...

* First, we are depending on the same API on both side (client and
server). This make things more complex, because the context is
different. For instance, there is no need to parse the DN on the client,
but we still do it.  I'm not sure that we could easily abvoid doing so.
To some extent, we are penalizing the client.

* The most complex situation is when we have to procesds the DN. This is
always done in two phases :
- slice the DN into RDNs, the RDNs into AVAs containg Values
- apply the schema on each value

We coulde easily imagine doing the processing in one single pass.
Actually, this is an error not to do so : this cost time, and the
classes are therfore not immutable.

* One specific problematic point is when we process escaped chars. For
instance, something like : 'cn=a\ \ \ b' is just a cn with a value
containing 3 spaces. This is what should be returned to the user, and
not a value with only one space. *But* we will be able to retrieve this
value using one of those filters : (cn=a b) or (cn=a  b) or (cn=
a         b). Actually the number of spaces is irrelevant when comparing
the value, it's not when it comes to send back the value to the user.
Again, it has all to see with the distinction between storing values and
comparing values.
For filters, we must unescape the String before sending it to the
server. The server does not handle the Filter as a String.

* The PrepareString class needs to be reviewed. We don't handle spaces
the way it's supposed to be done.

Value Class
-----------

I'm not exactly proud of it. It was a way to avoid having code like :

    if ( value instance of String )
    {
        // This is a String
    }
    else
    {
        // This is a byte[]
    }

so now, we have StringValue and BinaryValue, both of them could be used
with an AttributeType when they are SchemaAware. In retrospect, I think
the distinction between String and Binary values was an error. We should
have a Value, holding both, with a flag in it. Chaning that means we
review the entire code, again...



Conclusion
==========

This is not a pleasant situation. We have some cases where we don't
handle things correctly, and this is largely due to some choices made a
decade ago. Now, I don't think that this should be kept as is. Sometime
a big refactoring is better than patching this and that...


Now, feel free to express yourself, I would be vert happy to have your
opinion.

Many thanks !


Mime
View raw message