directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <>
Subject Value processing on the server (was Re: Directory Studio: Backslash in DN breaks studio)
Date Fri, 11 Mar 2016 22:54:47 GMT
I renamed the thread for clarity.

I'd like to add a few more comments and infos.

We could question the rational behind the normalization being done
preemptively when we receive the data. All in all, we will return
entries as they were injected, without normalization, and most of the
Entry's values will never need to be compared, thus won't need

As I said, normalization is mainly needed for comparisons. So why do we
normalize *all* the values ?

This is a choice made early in the Server's design. For the values that
are going to be compared, we whave three options :
-1- normalize when we initially receive the data
-2- normalize the value only when we need to compare it
-3- normalize the value when we first need it, and keep the normalized
value in memory, to avoid following normalizations.

Let's see what would be the consequences of the third option.

The big advantage of this choice is that we save a lot of CPU when we
are just going to send back the value as they were received. If we push
this logic to its limit, we even don't need to transform it from UTF-8
to a String. the problem being that *if* we need to compare this value,
then we have to normalize it on the fly. Now, we can have a cache that
keep a track of the normalized values. Cache means contention, or means
memory consumption. If we don't want contention, then we need to use a
TLS cache, but then we will eat way more memory. At some point, unless
we have a hell lot of memory, we will spend more time checking in the
Cache if the normalized value is present, and if not, normalize it, and
put it in the cache (most certainly discarding another value from this
limited cache). If we don't use a TLS cache, then we will have some
contention, and that would be absolutely killing.

Another option would be to store the two forms in the Value instance :
byte[] and normalized String. If the normalized String is null, then if
we need it, we normalize the byte[]. As the Value is supposed to be
immutable, it's ok to do so, except if the Value is shared across
multiple threads (which is possible because we are caching entries).
Now, we have to introduce some level of contention, and that is not a
good news.

Last, not least, we will have to normalize this Value everytime we read
it from teh disk, and if it's not in the cache. Not exactly free...

The second option is clearly the worse : we will have to normalize the
value every now and then. That may be tens of times in some case. We
certanly don't want that.

That left us with the first option, where teh value is normalized only
once, when we receive the value from the client. The consequence is that
we will write more data on disk, and when we will read them back, we
will read more data too. This comes with a cost (more deserialization),
but as we use some cache, this is acceptable. The problem is that as we
store more data, we put more pressure on the cache and the GC.

All in all, this is a balance, and we decided a long time ago that
having the normalized available at all time was way simpler, and
potentially more efficient.

Also keep in mind that many values won't be normalized at all (as they
won't have any matching rule).

Le 11/03/16 01:18, Emmanuel Lécharny a écrit :
> Le 10/03/16 22:58, Stefan Seelmann a écrit :
>> On 03/09/2016 07:59 PM, Emmanuel Lécharny wrote:
>>> Le 09/03/16 18:54, Philip Peake a écrit :
>>> Can you be a bit more explicit ?
>> Probably same cause as in
>> and
> I took some time last week-end to re-think the whole problem. There are
> a lot of things we are doing wrong, IMO. Don't get me wrong though :
> most of the time, it simply works.
> FTR, I send this mail to the dev list, copying it to the users list.
> <this is going to be a long mail...>
> First of all, we need to distinguish the clients from the server. They
> are to different beasts, and we should assume the server *always*
> receive data that are potentially harmful and incorrect.
> Then we also need to distinguish String values and Binary values. The
> reason we make this distinction is that String values are going to be
> encoded in UTF-8 thus using multi-bytes, and also because we need to
> convert them from UTF-8 to Unicode (and back).
> Let's put aside the binary data at the moment.
> The server
> ==========
> Value
> -----
> We receive UTF-8 Strings, we convert them to Unicode and now we can
> process them in Java. We do need this conversion because we need to
> check the values before injecting them in the backend. Doing such checks
> in UTF-8 would be very impracticable.
> There is one critical operation that is done on values when we process
> them : we most of the time need to compare them to another value :
> typically, when we have an index associated with this value, or when we
> have a search filter. Comparing two values is not as simple as doing an
> lexicographic comparison sadly. We need to 'prepare' the values
> accordingly to some very specific rules, and we should also 'normalize'
> those values accordingly to some syntax.
> A comparison is done following this process :
> Val 1 -> normalization -> preparation-+
>                                        \
>                                         .--> Comparison
>                                        /
> Val 2 -> normalization -> preparation-+
> We can save some processing if one of the two values has already been
> normalized or prepared. Actually, we should do that only once for each
> value : when they are injected into the server for the first time. But
> doing so would also induce some constraint : disk usage (saving many
> forms of a data cost space, and time when it comes to read them from
> disk. This is all about balance...).
> Anyway, most of the time, we get a value and we just need to store it
> into the backend after having checked its syntax. And that's the key :
> checking the syntax requires some preparation. Here is how we proceed
> when we just need to chck teh syntax :
> Value --> normalization --> syntax check
> There is no string preparation.
> The normalization is specific to each AttributeType. The String
> Preparation is the same for all the values.
> Now, there are two specific use cases : filters, and DN.
> Filter
> ------
> A filter always contains a String that needs to be processed to give a
> tuple : <attributeType, value>. There are rules that must be applied to
> transform the incoming filter to this tuple. Once we have created this
> tuple, we can normalize and prepare the tuple's value : something that
> might be complex, especially when dealing with substring matches.
> So for filter, the process is :
> fliter -> preProcessing -> Tuple<AttributeType, Value> -> normalization
> -> preparation
> The String preparation is required because the filter's value will be
> compared with what we fetch from the backend.
> DN
> --
> The DN is not a String. It's a list of RDN, where each RDN is a list of
> AVA, where each AVA is a tuple <attributeType, Value> Although, as a
> filter, when it's received, or stored, it's as a String, and there are
> some specific rules to follow to get the String being transformed to
> RDNs. Bottom line, the DN preprocessing is the following :
> DN String --> preProcessing -> Rdns, AVA, Tuple<AttributeType, value>
> [-> normalization -> preparation] (for each AVA)
> Again, the String preparation is needed because we will store the RDN
> into an index, and that requires some comparison (note that it's not
> always the case, typically for attributeType with a DN syntax).
> Comparing values
> ----------------
> We saw that we need to normalize and prepare values before being able to
> compare them. A good question would be : do we need to prepare the
> String beforehand or when we need to compare values ? That's quite
> irrelevant : it's a choice that need to be make at some point, but it
> just impacts the performance and the storage size. We can consider that
> when we start comparing two values, they are already prepared (either
> because we have stored a prepared version of the String, or because we
> have just prepared teh String on the fly before calling the compare method).
> The Client
> ==========
> I will just talk about the Ldap API here, I'm not interested in any
> other client.
> We have two flavors : schema eware and schema agnostic. We also have to
> consider two aspects : when we send data to the server, and when we
> process the result.
> Schema agnostic client
> ----------------------
> There is no so much we can do here. we have no idea about what can be
> the value's syntax, so we can't normalize the value. Bottom line, here
> is the basic processing of a value sent to the server :
> - we don't touch the values. At all. We just convert them from Unicode
> to UTF-8
> - we pre-process filters to feed the SearchRequest. values are unescaped
> (ie the escaped chars are replaced by their binary counterpart)
> - we don't touch the DN
> Whe values are received from the server, we need to process the data
> this way :
> - we don't touch the values, we just convert them from UTF-8 to Unicode
> - we don't touch the DN : it's already in String format, we just convert
> them from UTF-8 to Unicode
> Schema aware client
> -------------------
> This is more complex, because now, we can process the values before
> sending them to the server. This put some load on the client side
> instead of pounding the server with incorrect data that will get
> rejected anyway.
> - Values : we normalize them, prepare them and check their syntax. At
> the end, we convert the original value from Unicode to UTF-8. As we can
> see, we lose the normalized and prepared value.
> - Filter : we unescape them, then we convert them to UTF-8
> - Dn : we parse it, unescape it, normalizing each value, and at the end,
> if the DN is valid, we send the original value as is, after having
> converting it to UTF-8
> As we can see, all what we do is to check the values before sending them
> to the remote server, except for the filter.
> For the received values, we first convert them to Unicode and that's
> pretty much it.
> Escaping
> --------
> DN and Filters need some pre-processing called unescaping when we have
> to transform them from a String to an internal instance. For Filter,
> this is always done on the client side, for the DN is done on the server
> side. The idea is to transform those values from a String (human
> readable) form to a binary form.
> What we do wrong
> ----------------
> We will only focus on the schema aware API here. This is what we use on
> the server side anyway...
> * First, we are depending on the same API on both side (client and
> server). This make things more complex, because the context is
> different. For instance, there is no need to parse the DN on the client,
> but we still do it.  I'm not sure that we could easily abvoid doing so.
> To some extent, we are penalizing the client.
> * The most complex situation is when we have to procesds the DN. This is
> always done in two phases :
> - slice the DN into RDNs, the RDNs into AVAs containg Values
> - apply the schema on each value
> We coulde easily imagine doing the processing in one single pass.
> Actually, this is an error not to do so : this cost time, and the
> classes are therfore not immutable.
> * One specific problematic point is when we process escaped chars. For
> instance, something like : 'cn=a\ \ \ b' is just a cn with a value
> containing 3 spaces. This is what should be returned to the user, and
> not a value with only one space. *But* we will be able to retrieve this
> value using one of those filters : (cn=a b) or (cn=a  b) or (cn=
> a         b). Actually the number of spaces is irrelevant when comparing
> the value, it's not when it comes to send back the value to the user.
> Again, it has all to see with the distinction between storing values and
> comparing values.
> For filters, we must unescape the String before sending it to the
> server. The server does not handle the Filter as a String.
> * The PrepareString class needs to be reviewed. We don't handle spaces
> the way it's supposed to be done.
> Value Class
> -----------
> I'm not exactly proud of it. It was a way to avoid having code like :
>     if ( value instance of String )
>     {
>         // This is a String
>     }
>     else
>     {
>         // This is a byte[]
>     }
> so now, we have StringValue and BinaryValue, both of them could be used
> with an AttributeType when they are SchemaAware. In retrospect, I think
> the distinction between String and Binary values was an error. We should
> have a Value, holding both, with a flag in it. Chaning that means we
> review the entire code, again...
> Conclusion
> ==========
> This is not a pleasant situation. We have some cases where we don't
> handle things correctly, and this is largely due to some choices made a
> decade ago. Now, I don't think that this should be kept as is. Sometime
> a big refactoring is better than patching this and that...
> Now, feel free to express yourself, I would be vert happy to have your
> opinion.
> Many thanks !

View raw message