directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Directory Server v1.5 > Index and IndexEntry
Date Sat, 25 Jun 2011 14:48:00 GMT
<html>
<head>
    <base href="https://cwiki.apache.org/confluence">
            <link rel="stylesheet" href="/confluence/s/2042/9/1/_/styles/combined.css?spaceKey=DIRxSRVx11&amp;forWysiwyg=true"
type="text/css">
    </head>
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
    <h2><a href="https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry">Index
and IndexEntry</a></h2>
    <h4>Page <b>edited</b> by             <a href="https://cwiki.apache.org/confluence/display/~elecharny">Emmanuel
L├ęcharny</a>
    </h4>
        <br/>
                         <h4>Changes (1)</h4>
                                 
    
<div id="page-diffs">
                    <table class="diff" cellpadding="0" cellspacing="0">
    
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" >| entryCSN | String (CSN) | Forbidden
| Stores the relation between a CSN and the associated entry ID | <br> <br></td></tr>
            <tr><td class="diff-added-lines" style="background-color: #dfd;">Those
index are exposed in full details in the [Structure and Organization] chapter. <br>
<br></td></tr>
            <tr><td class="diff-unchanged" >{note} <br>The Alias, oneAlias
and subAlias indexes are probably useless. <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
    
            </table>
    </div>                            <h4>Full Content</h4>
                    <div class="notificationGreySide">
        <h1><a name="IndexandIndexEntry-IndexAbstraction"></a>Index Abstraction</h1>

<p>An Index is a bidirectional LUT (Look-Ups Table) which sorts keys (forward index)
and values (reverse index).  It can be built on any physical attribute or a computed characteristic
of an entry. </p>

<p>Index Tuples contain a value for the attribute and the entry identifier which represents
a pointer into the MasterTable.  This way lookups into the master table for entries with a
specific attribute value are fast.  For example, when conducting a search for all entries
where the <b>foo</b> attribute is equal to the String 'bar', the search engine
looks first to see if an index is available for <b>foo</b>.  If an Index on <b>foo</b>
exists then it is used to quickly list the identifiers of all entries with attribute <b>foo</b>
equal to 'bar'.  This lookup occurs in almost constant time and the Index can be walked to
only return those identifiers over the Tuples with key set to 'bar'. The following schema
expose the relations between all those elements :</p>


<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/foo-index.png?version=1&amp;modificationDate=1308747662866"
style="border: 1px solid black" /></span></p>


<p>Without an Index on <b>foo</b>, the following sequence of events must
occur on each entry in the master table:</p>

<ol>
	<li>the entry must be read from the MasterTable as a byte[]</li>
	<li>the entry must then be deserialized</li>
	<li>the <b>foo</b> attribute must be extracted</li>
	<li>each value of the <b>foo</b> attribute must be compared to see if it
equals 'bar'</li>
</ol>


<p>As you can see this is an expensive proposition. With very large directories search
may take a very long time or may return without the intended entries because time limits are
reached. Indices are critical for increasing search performance.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>Even if we have all the entries
in the cache, this is still an O&#40;n&#41; operation. We just spare the disk access
and the deserialization phase.</td></tr></table></div>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>LDAP amortizes the expense of
building indices at add, modify and delete times in return for faster search performance.
 LDAP servers for this reason are highly indexed and are better suited for read intensive
applications.</td></tr></table></div>

<h2><a name="IndexandIndexEntry-IndexForwardandReversetables"></a>Index
Forward and Reverse tables</h2>

<p>At the moment, an Index is implemented using 2 tables :</p>
<ul>
	<li>The Forward table, which maps the key to a list of entry's ID in the master table</li>
	<li>The Reverse table, which maps the entry's ID to the keys</li>
</ul>


<p>The forward table is used to retrieve the entry from the MasterTable, when we only
have a key. The key can be associate with more than one entry, so we will get back a cursor
on the associated entry's ID when we do a lookup.</p>

<p>The reverse index is used when we do a search on a combination of attributes : (&amp;(cn=test)(sn=acme)).
If we suppose that the selected primary index will be the <b>CN</b> index, then
we will pull a list of entry's ID. Now, using the <b>SN</b> reverse index, we
can avoid pulling each entry from the MasterTable, simply because we can check in the <b>SN</b>
reverse table that the selected IDs are present or not.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>This might be superfluous : it's
enough to pull the <b>SN</b> entry's ID for the <b>acme</b> value,
and do an intersection with the first set of entry's ID. The biggest advantage would be that
we will speed up the modification operations if we were to remove those reverse tables.</td></tr></table></div>

<h2><a name="IndexandIndexEntry-IndexCursor%27sGetIndexEntryObjects"></a>Index
Cursor's Get IndexEntry Objects</h2>

<p>Index Cursors, unlike Tables that return Tuples, return IndexEntry objects instead
on get() operations after positioning the Cursor. Indices always use a Long identifier into
the master table for forward LUT keys and reverse LUT values. So instead of returning Cursors
over two kinds of Tuples over these LUTs, their Cursors return IndexEntry objects which swap
Tuple key/value pairs where appropriate.  </p>

<p>For example, let's presume the <b>foo</b> index contains String values
for the foo attributeType.  The forward LUT will map Strings representing foo attribute values
like 'bar' to Long identifiers pointing to entries in the MasterTable.  The reverse LUT will
map Long identifiers pointing to entries in the MasterTable to Strings representing values
of the foo attribute in that entry.  The Tuples of these LUT's look like so:</p>

<ul>
	<li>Forward LUT =&gt; Tuple&lt;String,Long&gt;</li>
	<li>Reverse LUT =&gt; Tuple&lt;Long,String&gt;</li>
</ul>


<p>Index Cursors walk these LUT's using Table Cursors that return Tuples. Regardless
of whether the forward LUT or the reverse LUT is used, Tuples are transduced into IndexEntry
objects which always return the Long identifier into the master table on calls to getId()
and the attribute value String (or whatever - really a generic type - String is used here
for the foo attribute example) on calls to getValue().  This way Index Cursors do not care
if the underlying Tuple Cursor walks the forward LUT or the reverse LUT. </p>

<h2><a name="IndexandIndexEntry-Indexusedintheserver"></a>Index used in
the server</h2>

<p>We used two distinct kind of index in the server :</p>
<ul>
	<li>system indexes</li>
	<li>user indexes</li>
</ul>


<p>The <b>system</b> indexes are those we create at startup. An administrator
can modify some of their characteristics, but basically, they will always exist. The list
of existing System index is given in the following table :</p>

<div class='table-wrap'>
<table class='confluenceTable'><tbody>
<tr>
<th class='confluenceTh'> Index </th>
<th class='confluenceTh'> Key </th>
<th class='confluenceTh'> duplicates </th>
<th class='confluenceTh'> Description </th>
</tr>
<tr>
<td class='confluenceTd'> ObjectClass </td>
<td class='confluenceTd'> String </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> Associate an ObjectClass to a list of entries ID </td>
</tr>
<tr>
<td class='confluenceTd'> Rdn </td>
<td class='confluenceTd'> ParentIdAndRdn </td>
<td class='confluenceTd'> Forbidden </td>
<td class='confluenceTd'> Stores the relation between a child entry and its parent in
the DIT. The key is a compound element built using the entry's RDN and the entry ID. </td>
</tr>
<tr>
<td class='confluenceTd'> oneLevel </td>
<td class='confluenceTd'> Long (ID) </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> An index storing a relation between an entry and all its children
</td>
</tr>
<tr>
<td class='confluenceTd'> subLevel </td>
<td class='confluenceTd'> Long (ID) </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> An index storing a relation between an entry and all its children,
recursively </td>
</tr>
<tr>
<td class='confluenceTd'> presence </td>
<td class='confluenceTd'> String </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> An index storing a relation between an AttributeType name
and the list of entry's ID having this AttributeType </td>
</tr>
<tr>
<td class='confluenceTd'> alias </td>
<td class='confluenceTd'> String </td>
<td class='confluenceTd'> Forbidden </td>
<td class='confluenceTd'> Stores the relation between an Alias and the targetted entry
ID </td>
</tr>
<tr>
<td class='confluenceTd'> oneAlias </td>
<td class='confluenceTd'> Long (ID) </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> Stores the list of aliases under an alias </td>
</tr>
<tr>
<td class='confluenceTd'> subAlias </td>
<td class='confluenceTd'> Long (ID) </td>
<td class='confluenceTd'> Allowed </td>
<td class='confluenceTd'> Stores the list of aliases under an alias with no limit in
depth </td>
</tr>
<tr>
<td class='confluenceTd'> entryUUID </td>
<td class='confluenceTd'> String (UUID) </td>
<td class='confluenceTd'> Forbidden </td>
<td class='confluenceTd'> Stores the relation between an UUID and the associated entry
ID </td>
</tr>
<tr>
<td class='confluenceTd'> entryCSN </td>
<td class='confluenceTd'> String (CSN) </td>
<td class='confluenceTd'> Forbidden </td>
<td class='confluenceTd'> Stores the relation between a CSN and the associated entry
ID </td>
</tr>
</tbody></table>
</div>


<p>Those index are exposed in full details in the <a href="/confluence/display/DIRxSRVx11/Structure+and+Organization"
title="Structure and Organization">Structure and Organization</a> chapter.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>The Alias, oneAlias and subAlias
indexes are probably useless.</td></tr></table></div>


<h2><a name="IndexandIndexEntry-ExampleDirectoryInformationTree%28DIT%29"></a>Example
Directory Information Tree (DIT)</h2>

<p>We'll use this example DIT for the purpose of our discussions around indices. </p>

<table class="sectionMacro" border="0" cellpadding="5" cellspacing="0" width="100%"><tbody><tr>
<td class="confluenceTd" valign="top">
<p><a href="http://cwiki.apache.org/confluence/download/attachments/80384/example_ldif.png"
class="external-link" rel="nofollow"><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_tree.png?version=1&amp;modificationDate=1206064430000"
style="border: 0px solid black" /></span></a></p></td>
<td class="confluenceTd" valign="top">
<p>Here we have an example tree where the nodes are labeled with their user provided
relative distinguished names at the point of creation.  The numbers in the nodes represent
the order of creation and the unique identifiers of these entries representing the Tuple keys
in the master table. We'll use this example and modified versions of it throughout this document.
So let's take a look at how the MasterTable would look but presume the distinguished names
to the right represent the full serialized entry as an array of bytes.<br/>
<span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_master.png?version=1&amp;modificationDate=1206066414000"
style="border: 0px solid black" /></span></p></td></tr></tbody></table>

<h2><a name="IndexandIndexEntry-IndexNormalization%2CandMatching"></a>Index
Normalization, and Matching</h2>

<p>An Index can be built on any attribute. For example we can build indices on the ou,
objectClass, and cn attributes in our example.  Here's what the forward lookup tables of these
indices would contain:</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>Remember an Index is bidirectional!
 This means we can lookup all the entry identifiers representing the set of entries with value
'bar' for the attribute <b>foo</b> using the forward LUT in the Index.  We can
also use the reverse LUT to lookup an id, to see all the values of attribute <b>foo</b>
an entry may contain.</td></tr></table></div>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>The previous note may be void
in ADS 2.0 : we don't necessarily need a reverse LUT.</td></tr></table></div>


<table class="sectionMacro" border="0" cellpadding="5" cellspacing="0" width="100%"><tbody><tr>
<td class="confluenceTd" valign="top">
<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_user_indices.png?version=1&amp;modificationDate=1206155368000"
style="border: 0px solid black" /></span></p>

<div class='table-wrap'>
<table class='confluenceTable'><tbody>
<tr>
<th class='confluenceTh'>OID</th>
<th class='confluenceTh'>Alias</th>
</tr>
<tr>
<td class='confluenceTd'> 2.5.6.0 </td>
<td class='confluenceTd'> top </td>
</tr>
<tr>
<td class='confluenceTd'> 2.5.6.4 </td>
<td class='confluenceTd'> organization </td>
</tr>
<tr>
<td class='confluenceTd'> 2.5.6.5 </td>
<td class='confluenceTd'> organizationalUnit </td>
</tr>
<tr>
<td class='confluenceTd'> 2.5.6.6 </td>
<td class='confluenceTd'> person </td>
</tr>
<tr>
<td class='confluenceTd'> 2.5.6.7 </td>
<td class='confluenceTd'> organizationalPerson </td>
</tr>
</tbody></table>
</div>


</td>
<td class="confluenceTd" valign="top">
<p>The index keys contain attribute values, and the values of the index contain entry
identifiers which refer back into the master table. Notice the values of attributes in these
indices have been transformed.  This transformation is referred to as normalization or canonicalization.
 This means the value has been reduced to a canonical form from where variations like for
example white space and case, or aliases have been removed. Other forms of normalization besides
this textual normalization may exist.  All variations of an attribute's value considered to
be equal are reduce down to this common unambiguous form.</p>

<p>The transformation is governed by the <b>EqualityMatch</b> matchingRules
associated with the attributeType. You'll notice for example the <b>ou</b> and
<b>cn</b> attribute matchingRules result in transformation that remove whitespace
variance while preserving tokenization: meaning we take out whitespace by reducing multiple
whitespace characters into a single space.  Also the case variance is also eliminated: all
values are reduced to lowercase.  Note this is possible because both <b>ou</b>
and <b>cn</b> are case insensitive since they use a matchingRule which ignores
case: their equality matchingRule is caseIgnoreMatch.</p>

<p>Attribute matchingRules may not all be this trivial.  For example the attribute may
have a complex binary syntax and there might be some intricate operation required to reduce
the value into a canonical representation.  At other times normalization may involve transforming
ambiguous representations into a unique identifier that can never have variance.  The <b>objectClass</b>
Index is a good example of this.  The objectClass attribute lists the objectClasses associated
with an entry.  Unfortunately LDAP allows alias names for objectClasses as well as other entities
like attributeTypes (i.e. cn verses commonName).  These aliases are human readable representations
but they can have case variance as well and any alias can be used.  Most schema entities in
LDAP have an Object IDentifier or OID.  OID's are by default canonical.  As you can see this
is the canonical representation used in the <b>objectClass</b> index.  Incidentally,
users are even allowed to use OID's for the value of the objectClass attribute if they wanted
to.  </p>

<p>Notice Index Tuple keys are all sorted and so are the Tuple values of the same key.
 Indices use matchingRules as well to order both keys and values based on matchingRules in
both the forward and reverse LUTs.  </p>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>If you look at the objectClass
index you'll see that the OID (2.5.6.0) for the 'top' objectClass is not included in the index.
 This is because every entry will contain 'top' so there's no point to bloating the index
including it as a value. If we need to walk all the entries in the server we can walk the
MasterTable or one of the system indices like the RDN or presence indices described <a
href="/confluence/display/DIRxSRVx11/Structure+and+Organization" title="Structure and Organization">here</a>.</td></tr></table></div></td></tr></tbody></table>

<h2><a name="IndexandIndexEntry-MatchingRules%2CComparatorsandNormalizers"></a>MatchingRules,
Comparators and Normalizers</h2>

<p>LDAP defines 3 different kinds of matchingRules for attributeTypes: equality, ordering
and substring.  The equality matchingRule governs how attribute values are compared for equality.
 Ordering matchingRules control how attribute values are compared to see if one value is greater
or less than another.  The substring matching rule manages how embedded fields in larger values
can be matched for patterns.  Most LDAP servers have different kinds of indices you can create
for each of these kinds of matching techniques.  Sometimes there's more for example some servers
have an approximate match index, which uses the soundex algorithm or a derivative to determine
matches.</p>

<p>ApacheDS has only one type of index and makes some trade offs in space, time and
simplicity while applying some common sense.  ApacheDS functionally models a matchingRule
as a Java Comparator coupled with a Normalizer.  The Normalizer applies the transformation
required to produce a canonical form.  The Comparator performs checks to see if values are
equal, greater or less than one another.  ApacheDS Index objects use the equality matchingRule's
Normalizer to create index Tuples and the ordering matchingRule's Comparator (only if defined)
to determine insertion order into Index LUT BTrees.  If there exists no ordering matchingRule
for the attributeType then the Comparator for the equality matchingRule is used instead.</p>

<p>If a search uses an equality assertion with the index attribute lookups return the
answers we need. If a search filter uses a substring assertion then depending on the pattern
the substring matchingRules are used (only if defined for the attributeType) with a bit more
processing using regular expressions to match normalized values on index walks. When &gt;
or &lt; filter operators are used, and a ordering matchingRule is defined for the attributeType
then that matchingRule's comparator is used to compare values on Index walks with Cursors.
Also because of these choices ApacheDS indices are bi-directional and so ApacheDS pays a price
on write operations to maintain both forward and reverse LUTs.</p>

<p>ApacheDS does not support approximate match since this kind of matching is language
sensitive and just a pile of horse poop.  For this reason approximate match does not work
with internationalization. No server has yet to satisfy my search when I've tried to phonetically
find names using approximate match.</p>

<p>So ApacheDS made some choices to stay simple.  We applied a pattern of thinking where
trade offs were made to keep the most common operations fastest while the least common operations
may suffer slightly.  Others like approximate match are essentially ignored.  ApacheDS processes
approximate match based filter terms as if they were equality assertions.</p>


<h1><a name="IndexandIndexEntry-JDBMBaseIndexImplementation"></a>JDBM Base
Index Implementation</h1>

<p>The JdbmIndex uses two JdbmTables each backed by it's own JDBM BTree in the same
file managed by a JDBM RecordManager for the file.  These files end with the .db file extension
and use the first alias name of the attributeType for the file name.  The JdbmIndex leverages
these JdbmTables for preforming forward reverse lookup operations to assert values as well
as for generating Cursors over the Tuples of these Tables.</p>

<p>JdbmIndex instances require schema information to properly initialize their LUTs
with the right Comparators.  JdbmIndex uses operations that write Tuples into these two JdbmTables
normalized values before inserting them.  Attribute values inserted into these JdbmTables
are normalized for speed and efficiency.  If these values were not normalized, the Cursors
walking these JdbmTables and lookup operations would need to normalize then compare.  Search
operations are heavily dependent on Index Cursors and lookup operations and normalization
is an expensive process.  Why repeat normalization and pay on each search operation when you
can pay the price once on write operations.  After all, this is LDAP, and LDAP is read optimized.</p>

<h2><a name="IndexandIndexEntry-PossibleRepeatNormalizationInefficiency"></a>Possible
Repeat Normalization Inefficiency</h2>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>This paragraph may be totally
obsolete. To be checked.</td></tr></table></div>

<p>JdbmIndex tries to protect against lookups without normalization by normalizing lookup
parameters before consulting the two JdbmTable based LUTs.  It also maintains a cache of normalizations
it has performed recently.  The cache in particular may be a waste since Normalizers also
try to cache recent normalization operations, and these normalizers are reused by the system.
 So we may be needlessly double caching and wasting memory.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>I'm not sure we cache normalized
values at all. To be double checked.</td></tr></table></div>

<p>Another possible problem is that JdbmIndex instance may not need to normalize attribute
values.  There were several discussions about eagerly normalizing attribute values in entries
especially with the move to the ServerEntry API. If this happens then we'll potentially have
3 copies of the same normalization result in memory during an operation.  Furthermore this
removes the need to cache at all.  A normalization accounting review is required to make sure
we're not wasting extra cycles and memory while defining the correct barriers to protect against
non-normalized attribute values.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>ATM, all the values used in an
index <b>are</b> normalized. This is done early in the process.</td></tr></table></div>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td><b>Eager Normalization Optimizations</b><br
/>When partitions begin to expose access to Index information, the server now is privy
to information it can use to efficiently conduct eager attribute value normalization in ServerEntry
objects.  At the normalization barrier, which will most likely be in the NormalizationInterceptor,
the server can determine the target partition the entry is associated with.  Knowing the target
partition, the server can check which attributes are indexed by the partition and pre-normalize
them.  Note that the user provided value is never lost, the ServerEntry stores both normalized
and user provided forms. This way value normalization and caching requirements are lifted
from partition indices.</td></tr></table></div>

    </div>
        <div id="commentsSection" class="wiki-content pageSection">
        <div style="float: right;">
            <a href="https://cwiki.apache.org/confluence/users/viewnotifications.action"
class="grey">Change Notification Preferences</a>
        </div>
        <a href="https://cwiki.apache.org/confluence/display/DIRxSRVx11/Index+and+IndexEntry">View
Online</a>
        |
        <a href="https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=80384&revisedVersion=38&originalVersion=37">View
Changes</a>
            </div>
</div>
</div>
</div>
</div>
</body>
</html>

Mime
View raw message