Ole,

Changing the topic ...

On 4/6/07, Ole Ersoy <ole.ersoy@gmail.com> wrote:

Each Entry has a set of ObjectClasses associated with it.
Those object classes determine the set of AttributeTypes that
the entry can have.

What I'm wondering about is when I look up a value of an entry,
does ApacheDS pass say:

org.apache.tuscany.DASConfig.baseDN as the key to look up the value
of this attribute?

Or does ApacheDS create a proxy key?

Like this org.apache.tuscany.DASConfig.baseDN = 1 for example.

And then when it stores the attribute it knows that the
AttributeType org.apache.tuscany.DASConfig.baseDN corresponds to 1,
so rather than storing

org.apache.tuscany.DASConfig.baseDN

200M times, it stores the 1 instead.  So the 1 is the proxy
for

org.apache.tuscany.DASConfig.baseDN

and ApacheDS keeps a list of proxies like this for all the
AttributeTypes that it has.

Does that make sense?

Ok I understand now.  The server's default partition implementation based on B+Trees (using jdbm)
does something similar to this but not exactly.  Let me explain:

First somewhat related the partition assigns a surrogate key to all entries instead of using it's DN
as the PK.  This prevents certain issues that can result when using a BTree to index on the DN
like dealing with long key prefixes.  I think Emmanuel at some point had the idea of storing the
DN in reverse within the index to avoid these problems. 

Second for attributeTypes the JdbmPartition uses the OID as the PK for the attribute rather than
using the alias of the attribute.  When you supply the server a filter like (cn=   Ole     Ersoy)
(*NOTE* the case and the extra spaces) the server parses the filter into a AST (which is one node)
and normalizes the attributeType in this node's attribute value assertion: the cn=   Ole     Ersoy
becomes 2.5.4.3=ole ersoy.

cn                    => 2.5.4.3
'   Ole    Ersoy'  => 'ole ersoy'

This is done by the normalizationServer (an interceptor) before reaching the partition.  Then when
the partition receives this normalized filter it's search engine will check and see if an index exists
for 2.5.4.2.  If one does not exist then all the entries are pulled from a master table and a full
scan occurs where each entry's 2.5.4.3 attribute is looked up, the value is normalized and compared
against the value in the filter using the comparator associated with the EQUALITY matchingRule of
the attributeType 2.5.4.3.

If an index does exist you're a very lucky user :).  The search engine then looks into the 2.5.4.3
index using the key 'ole ersoy'.  Indices by the way store normalized values as keys (using the normalizer
of the EQUALITY matchingRule associated with the attributeaType the index is built on) and the
entry ID as values.  With a BTree this is almost a constant operation (log n).   Oh and the keys into the
index as well as values are sorted using the comparator of the ORDERING matchingRule for the
attributeType the index is built on.  So once the value is found the partition recovers the entry id and
pulls out the entry from the master table.  Then it advances to the next value that equals 'ole ersoy'
and does the same until all 'ole ersoy' index records have been returned.

So long answer this time shows that the server sort of uses a surrogate key for attributeTypes and that
is the OID for the attributeType.  Rather the server uses the OID as the PK would be a correct statement
since the OID is not a surrogate (derived/artificial) key.

Now for the objectClass attribute's values in entries:  By default the JdbmPartition comes with an index
on this very special attributeType.  If it did not we'd be hosed.  When adding new entries the ORDERING
matchingRule for the objectClass attributeType is used and this matchingRule will transform the values for
the objectClass attribute into the OID of the objectClasses for that entry. 

Emmanuel correct me if I am wrong here.

The entry's objectClass values when stored in the master table are not touched.  It is stored as is
so it can be returned as the user added it with case variance.  If we did not do this then the server would
return normalized values for the objectClass.  In fact no attributes of an entry in the master table are
normalized: not objectClass, not cn, not anything.  So what is stored in the master table is the entry as
it was supplied or modified.  The objectClass index however will contain for the index record key
normalized values of the objectClass values which are the OIDs of the objectClasses. 

Here too the server uses the OID as the PK (again this is really not a surrogate key after all).

Now the big question is what impact will there be if we used a real surrogate key instead?  First of we
would not use String based comparators but would use Integer comparators for both objectClasses and
attributeTypes assigned through some scheme.  Let's consider each scenario separately.

Using Surrogate Keys for AttributeTypes
----------------------------------------------------------

If this is done then the server must manage a persistent mechanism to translate alias names and OIDs to
the surrogate key consistently across restarts.  This table can be immediately on start up loaded into memory
and written to disk on change like a write through cache.  We would still have to store entries in their 'user
provided' form where non of the values of attributes or the names of the attributes for that matter are normalized.
This is to ensure users get back entries as they put them into the server.  When preparing a filter for normalization
the attributes in the attribute value assertion (ava) would need to be normalized into the surrogate key instead of
the OID.  The value would be processed in the original manner it was handled in before.

When finding indices for attributes in an AVA we would look them up via the surrogate key for the attributeType.
Then all operations would proceed as usual.  There is no space conservation advantage here while incurring an
extra in memory lookup to transform the OID/alias into a surrogate key.  Plus there is the overhead in memory
of maintain this OID-alias to surrogate key mapping.

Conclusion: not worth doing for any reason at all.

Using Surrogate Keys for ObjectClasses
-----------------------------------------------------------

Again the entry is stored as-is in the master table with all the original values for the objectClass attribute in the
entry.  The objectClass index instead of storing tuples like:
 
 OID : Entry ID
 ( 2.5.6.3, 96)

Will now use the surrogate key for the OID and look like:

SKey : Entry ID
( 563, 96 )

If 563 is the surrogate key assigned to the objectClass organization (with OID 2.5.6.3).

BTW there is no need to normalize filters here so their attributes us the SK.  The objectClass attribute
is just a very special case.

This will save some space overhead.  It might even result in a slightly faster lookup within the btree because
there are less bytes to compare in most cases with the integer based surrogate key than with the OID string's
bytes.  Actually now that I think about it this might be much faster since the OID String needs a byte[]->String
transform to be properly compared.  The BTree uses a fast StringComparator for this but still it will cost more
than using an IntegerComparator in all cases.

The space conservation on this index is high but the overall conservation in the partition is not going to be
much.  The performance impact to search expressions based on the objectClass attribute would be much
faster.  But keep in mind that it will only occur with this attribute yet this is a common attribute use in most
search operations.  So it might be worth while using.  An experiment might be in order here.


High Level Impact
--------------------------

For attributeTypes it's a very bad idea to use SKs for the attributeType instead of the OID.  First of all
the normalization of the filter AVA's attribute to the SK will make the filter unintelligible to anything but
the JdbmPartition implementation which knows how to handle these search operations.  Every partition
implementation would then be tasked with doing this same thing to interpret the filter's attributes.

For objectClasses the impact could be significant.  However the partition implementation would need
to map objectClasses to assigned SKs and do it consistently across restarts.  As it stands now the
partition is designed to not have to know anything other than about attributeTypes, their syntaxes and
matchingRules to properly conduct CRUD operations on entries.  Having to do this means changing the
jdbm implementation a tiny bit.  Technically the Partition interface need not change since the
ObjectClassRegistry can be accessed from the init() stage when starting up the partition via the
server configuration object.

Conclusion
----------------

The use of SKs for ATs is a bad idea in all aspects without much benefit.   The use of SKs for OCs
might have some space conservation but not major.  The performance advantage is questionable and
requires solid performance metrics to determine the true value of such a 'one off' being added to the
server.  Is the complexity worth the performance boost is the question to answer.

Alex