directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Karasulu" <akaras...@apache.org>
Subject [ApacheDS][Partition] Using surrogate keys for attributeType aliases and objectClass aliases (was Re: [SCHEMA] Can two different LDAP AttributeType's have the same name?)
Date Fri, 06 Apr 2007 06:56:59 GMT
Ole,

Changing the topic ...

On 4/6/07, Ole Ersoy <ole.ersoy@gmail.com> wrote:
>
>
> Each Entry has a set of ObjectClasses associated with it.
> Those object classes determine the set of AttributeTypes that
> the entry can have.
>
> What I'm wondering about is when I look up a value of an entry,
> does ApacheDS pass say:
>
> org.apache.tuscany.DASConfig.baseDN as the key to look up the value
> of this attribute?
>
> Or does ApacheDS create a proxy key?
>
> Like this org.apache.tuscany.DASConfig.baseDN = 1 for example.
>
> And then when it stores the attribute it knows that the
> AttributeType org.apache.tuscany.DASConfig.baseDN corresponds to 1,
> so rather than storing
>
> org.apache.tuscany.DASConfig.baseDN
>
> 200M times, it stores the 1 instead.  So the 1 is the proxy
> for
>
> org.apache.tuscany.DASConfig.baseDN
>
> and ApacheDS keeps a list of proxies like this for all the
> AttributeTypes that it has.
>
> Does that make sense?


Ok I understand now.  The server's default partition implementation based on
B+Trees (using jdbm)
does something similar to this but not exactly.  Let me explain:

First somewhat related the partition assigns a surrogate key to all entries
instead of using it's DN
as the PK.  This prevents certain issues that can result when using a BTree
to index on the DN
like dealing with long key prefixes.  I think Emmanuel at some point had the
idea of storing the
DN in reverse within the index to avoid these problems.

Second for attributeTypes the JdbmPartition uses the OID as the PK for the
attribute rather than
using the alias of the attribute.  When you supply the server a filter like
(cn=   Ole     Ersoy)
(*NOTE* the case and the extra spaces) the server parses the filter into a
AST (which is one node)
and normalizes the attributeType in this node's attribute value assertion:
the cn=   Ole     Ersoy
becomes 2.5.4.3=ole ersoy.

cn                    => 2.5.4.3
'   Ole    Ersoy'  => 'ole ersoy'

This is done by the normalizationServer (an interceptor) before reaching the
partition.  Then when
the partition receives this normalized filter it's search engine will check
and see if an index exists
for 2.5.4.2.  If one does not exist then all the entries are pulled from a
master table and a full
scan occurs where each entry's 2.5.4.3 attribute is looked up, the value is
normalized and compared
against the value in the filter using the comparator associated with the
EQUALITY matchingRule of
the attributeType 2.5.4.3.

If an index does exist you're a very lucky user :).  The search engine then
looks into the 2.5.4.3
index using the key 'ole ersoy'.  Indices by the way store normalized values
as keys (using the normalizer
of the EQUALITY matchingRule associated with the attributeaType the index is
built on) and the
entry ID as values.  With a BTree this is almost a constant operation (log
n).   Oh and the keys into the
index as well as values are sorted using the comparator of the ORDERING
matchingRule for the
attributeType the index is built on.  So once the value is found the
partition recovers the entry id and
pulls out the entry from the master table.  Then it advances to the next
value that equals 'ole ersoy'
and does the same until all 'ole ersoy' index records have been returned.

So long answer this time shows that the server sort of uses a surrogate key
for attributeTypes and that
is the OID for the attributeType.  Rather the server uses the OID as the PK
would be a correct statement
since the OID is not a surrogate (derived/artificial) key.

Now for the objectClass attribute's values in entries:  By default the
JdbmPartition comes with an index
on this very special attributeType.  If it did not we'd be hosed.  When
adding new entries the ORDERING
matchingRule for the objectClass attributeType is used and this matchingRule
will transform the values for
the objectClass attribute into the OID of the objectClasses for that entry.


Emmanuel correct me if I am wrong here.

The entry's objectClass values when stored in the master table are not
touched.  It is stored as is
so it can be returned as the user added it with case variance.  If we did
not do this then the server would
return normalized values for the objectClass.  In fact no attributes of an
entry in the master table are
normalized: not objectClass, not cn, not anything.  So what is stored in the
master table is the entry as
it was supplied or modified.  The objectClass index however will contain for
the index record key
normalized values of the objectClass values which are the OIDs of the
objectClasses.

Here too the server uses the OID as the PK (again this is really not a
surrogate key after all).

Now the big question is what impact will there be if we used a real
surrogate key instead?  First of we
would not use String based comparators but would use Integer comparators for
both objectClasses and
attributeTypes assigned through some scheme.  Let's consider each scenario
separately.

Using Surrogate Keys for AttributeTypes
----------------------------------------------------------

If this is done then the server must manage a persistent mechanism to
translate alias names and OIDs to
the surrogate key consistently across restarts.  This table can be
immediately on start up loaded into memory
and written to disk on change like a write through cache.  We would still
have to store entries in their 'user
provided' form where non of the values of attributes or the names of the
attributes for that matter are normalized.
This is to ensure users get back entries as they put them into the server.
When preparing a filter for normalization
the attributes in the attribute value assertion (ava) would need to be
normalized into the surrogate key instead of
the OID.  The value would be processed in the original manner it was handled
in before.

When finding indices for attributes in an AVA we would look them up via the
surrogate key for the attributeType.
Then all operations would proceed as usual.  There is no space conservation
advantage here while incurring an
extra in memory lookup to transform the OID/alias into a surrogate key.
Plus there is the overhead in memory
of maintain this OID-alias to surrogate key mapping.

Conclusion: not worth doing for any reason at all.

Using Surrogate Keys for ObjectClasses
-----------------------------------------------------------

Again the entry is stored as-is in the master table with all the original
values for the objectClass attribute in the
entry.  The objectClass index instead of storing tuples like:

 OID : Entry ID
 (2.5.6.3, 96)

Will now use the surrogate key for the OID and look like:

SKey : Entry ID
( 563, 96 )

If 563 is the surrogate key assigned to the objectClass organization (with
OID 2.5.6.3).

BTW there is no need to normalize filters here so their attributes us the
SK.  The objectClass attribute
is just a very special case.

This will save some space overhead.  It might even result in a slightly
faster lookup within the btree because
there are less bytes to compare in most cases with the integer based
surrogate key than with the OID string's
bytes.  Actually now that I think about it this might be much faster since
the OID String needs a byte[]->String
transform to be properly compared.  The BTree uses a fast StringComparator
for this but still it will cost more
than using an IntegerComparator in all cases.

The space conservation on this index is high but the overall conservation in
the partition is not going to be
much.  The performance impact to search expressions based on the objectClass
attribute would be much
faster.  But keep in mind that it will only occur with this attribute yet
this is a common attribute use in most
search operations.  So it might be worth while using.  An experiment might
be in order here.


High Level Impact
--------------------------

For attributeTypes it's a very bad idea to use SKs for the attributeType
instead of the OID.  First of all
the normalization of the filter AVA's attribute to the SK will make the
filter unintelligible to anything but
the JdbmPartition implementation which knows how to handle these search
operations.  Every partition
implementation would then be tasked with doing this same thing to interpret
the filter's attributes.

For objectClasses the impact could be significant.  However the partition
implementation would need
to map objectClasses to assigned SKs and do it consistently across
restarts.  As it stands now the
partition is designed to not have to know anything other than about
attributeTypes, their syntaxes and
matchingRules to properly conduct CRUD operations on entries.  Having to do
this means changing the
jdbm implementation a tiny bit.  Technically the Partition interface need
not change since the
ObjectClassRegistry can be accessed from the init() stage when starting up
the partition via the
server configuration object.

Conclusion
----------------

The use of SKs for ATs is a bad idea in all aspects without much benefit.
The use of SKs for OCs
might have some space conservation but not major.  The performance advantage
is questionable and
requires solid performance metrics to determine the true value of such a
'one off' being added to the
server.  Is the complexity worth the performance boost is the question to
answer.

Alex

Mime
View raw message