directory-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
Subject [CONF] Apache Directory Server v1.5 > Index and IndexEntry
Date Thu, 23 Jun 2011 11:33:00 GMT
    <base href="">
            <link rel="stylesheet" href="/confluence/s/2042/9/1/_/styles/combined.css?spaceKey=DIRxSRVx11&amp;forWysiwyg=true"
<body style="background: white;" bgcolor="white" class="email-body">
<div id="pageContent">
<div id="notificationFormat">
<div class="wiki-content">
<div class="email">
    <h2><a href="">Index
and IndexEntry</a></h2>
    <h4>Page <b>edited</b> by             <a href="">Emmanuel
                         <h4>Changes (1)</h4>
<div id="page-diffs">
                    <table class="diff" cellpadding="0" cellspacing="0">
            <tr><td class="diff-snipped" >...<br></td></tr>
            <tr><td class="diff-unchanged" > <br>{note} <br></td></tr>
            <tr><td class="diff-changed-lines" >Even if we have all the entries
in the cache, this is still an <span class="diff-changed-words">O<span class="diff-added-chars"style="background-color:
#dfd;">\</span>(n<span class="diff-added-chars"style="background-color: #dfd;">\</span>)</span>
operation. We just spare the disk access and the deserialization phase. <br></td></tr>
            <tr><td class="diff-unchanged" >{note} <br> <br></td></tr>
            <tr><td class="diff-snipped" >...<br></td></tr>
    </div>                            <h4>Full Content</h4>
                    <div class="notificationGreySide">
        <h1><a name="IndexandIndexEntry-IndexAbstraction"></a>Index Abstraction</h1>

<p>An Index is a bidirectional LUT (Look-Ups Table) which sorts keys (forward index)
and values (reverse index).  It can be built on any physical attribute or a computed characteristic
of an entry. </p>

<p>Index Tuples contain a value for the attribute and the entry identifier which represents
a pointer into the MasterTable.  This way lookups into the master table for entries with a
specific attribute value are fast.  For example, when conducting a search for all entries
where the <b>foo</b> attribute is equal to the String 'bar', the search engine
looks first to see if an index is available for <b>foo</b>.  If an Index on <b>foo</b>
exists then it is used to quickly list the identifiers of all entries with attribute <b>foo</b>
equal to 'bar'.  This lookup occurs in almost constant time and the Index can be walked to
only return those identifiers over the Tuples with key set to 'bar'. The following schema
expose the relations between all those elements :</p>

<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/foo-index.png?version=1&amp;modificationDate=1308747662866"
style="border: 1px solid black" /></span></p>

<p>Without an Index on <b>foo</b>, the following sequence of events must
occur on each entry in the master table:</p>

	<li>the entry must be read from the MasterTable as a byte[]</li>
	<li>the entry must then be deserialized</li>
	<li>the <b>foo</b> attribute must be extracted</li>
	<li>each value of the <b>foo</b> attribute must be compared to see if it
equals 'bar'</li>

<p>As you can see this is an expensive proposition. With very large directories search
may take a very long time or may return without the intended entries because time limits are
reached. Indices are critical for increasing search performance.</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>Even if we have all the entries
in the cache, this is still an O&#40;n&#41; operation. We just spare the disk access
and the deserialization phase.</td></tr></table></div>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>LDAP amortizes the expense of
building indices at add, modify and delete times in return for faster search performance.
 LDAP servers for this reason are highly indexed and are better suited for read intensive

<h2><a name="IndexandIndexEntry-IndexCursor%27sGetIndexEntryObjects"></a>Index
Cursor's Get IndexEntry Objects</h2>

<p>Index Cursors, unlike Tables that return Tuples, return IndexEntry objects instead
on get() operations after positioning the Cursor. Indices always use a Long identifier into
the master table for forward LUT keys and reverse LUT values. So instead of returning Cursors
over two kinds of Tuples over these LUTs, their Cursors return IndexEntry objects which swap
Tuple key/value pairs where appropriate.  </p>

<p>For example, let's presume the <b>foo</b> index contains String values
for the foo attributeType.  The forward LUT will map Strings representing foo attribute values
like 'bar' to Long identifiers pointing to entries in the MasterTable.  The reverse LUT will
map Long identifiers pointing to entries in the MasterTable to Strings representing values
of the foo attribute in that entry.  The Tuples of these LUT's look like so:</p>

	<li>Forward LUT =&gt; Tuple&lt;String,Long&gt;</li>
	<li>Reverse LUT =&gt; Tuple&lt;Long,String&gt;</li>

<p>Index Cursors walk these LUT's using Table Cursors that return Tuples. Regardless
of whether the forward LUT or the reverse LUT is used, Tuples are transduced into IndexEntry
objects which always return the Long identifier into the master table on calls to getId()
and the attribute value String (or whatever - really a generic type - String is used here
for the foo attribute example) on calls to getValue().  This way Index Cursors do not care
if the underlying Tuple Cursor walks the forward LUT or the reverse LUT. </p>

<h2><a name="IndexandIndexEntry-ExampleDirectoryInformationTree%28DIT%29"></a>Example
Directory Information Tree (DIT)</h2>

<p>We'll use this example DIT for the purpose of our discussions around indices. </p>

<table class="sectionMacro" border="0" cellpadding="5" cellspacing="0" width="100%"><tbody><tr>
<td class="confluenceTd" valign="top">
<p><a href=""
class="external-link" rel="nofollow"><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_tree.png?version=1&amp;modificationDate=1206064430000"
style="border: 0px solid black" /></span></a></p></td>
<td class="confluenceTd" valign="top">
<p>Here we have an example tree where the nodes are labeled with their user provided
relative distinguished names at the point of creation.  The numbers in the nodes represent
the order of creation and the unique identifiers of these entries representing the Tuple keys
in the master table. We'll use this example and modified versions of it throughout this document.
So let's take a look at how the MasterTable would look but presume the distinguished names
to the right represent the full serialized entry as an array of bytes.<br/>
<span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_master.png?version=1&amp;modificationDate=1206066414000"
style="border: 0px solid black" /></span></p></td></tr></tbody></table>

<h2><a name="IndexandIndexEntry-IndexNormalization%2CandMatching"></a>Index
Normalization, and Matching</h2>

<p>An Index can be built on any attribute. For example we can build indices on the ou,
objectClass, and cn attributes in our example.  Here's what the forward lookup tables of these
indices would contain:</p>

<div class='panelMacro'><table class='noteMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/warning.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>Remember an Index is bidirectional!
 This means we can lookup all the entry identifiers representing the set of entries with value
'bar' for the attribute <b>foo</b> using the forward LUT in the Index.  We can
also use the reverse LUT to lookup an id, to see all the values of attribute <b>foo</b>
an entry may contain.</td></tr></table></div>

<table class="sectionMacro" border="0" cellpadding="5" cellspacing="0" width="100%"><tbody><tr>
<td class="confluenceTd" valign="top">
<p><span class="image-wrap" style=""><img src="/confluence/download/attachments/80384/example_user_indices.png?version=1&amp;modificationDate=1206155368000"
style="border: 0px solid black" /></span></p>

<div class='table-wrap'>
<table class='confluenceTable'><tbody>
<th class='confluenceTh'>OID</th>
<th class='confluenceTh'>Alias</th>
<td class='confluenceTd'> </td>
<td class='confluenceTd'> top </td>
<td class='confluenceTd'> </td>
<td class='confluenceTd'> organization </td>
<td class='confluenceTd'> </td>
<td class='confluenceTd'> organizationalUnit </td>
<td class='confluenceTd'> </td>
<td class='confluenceTd'> person </td>
<td class='confluenceTd'> </td>
<td class='confluenceTd'> organizationalPerson </td>

<td class="confluenceTd" valign="top">
<p>The index keys contain attribute values, and the values of the index contain entry
identifiers which refer back into the master table. Notice the values of attributes in these
indices have been transformed.  This transformation is referred to as normalization or canonicalization.
 This means the value has been reduced to a from where variations like for example white space
and case, or aliases have been removed. Other forms of normalization besides this textual
normalization may exist.  All variations of an attribute's value considered to be equal are
reduce down to this common unambiguous form.</p>

<p>The transformation is governed by the equality match matchingRules associated with
the attributeType. You'll notice for example the <b>ou</b> and <b>cn</b>
attribute matchingRules result in transformation that remove whitespace variance while preserving
tokenization: meaning we take out whitespace by reducing multiple whitespace characters into
a single space.  Also the case variance is also eliminated: all values are reduced to lowercase.
 Note this is possible because both <b>ou</b> and <b>cn</b> are case
insensitive since they use a matchingRule which ignores case: their equality matchingRule
is caseIgnoreMatch.</p>

<p>Attribute matchingRules may not all be this trivial.  For example the attribute may
have a complex binary syntax and there might be some intricate operation required to reduce
the value into a canonical representation.  At other times normalization may involve transforming
ambiguous representations into a unique identifier that can never have variance.  The <b>objectClass</b>
Index is a good example of this.  The objectClass attribute lists the objectClasses associated
with an entry.  Unfortunately LDAP allows alias names for objectClasses as well as other entities
like attributeTypes (i.e. cn verses commonName).  These aliases are human readable representations
but they can have case variance as well and any alias can be used.  Most schema entities in
LDAP have an Object IDentifier or OID.  OID's are be default canonical.  As you can see this
is the canonical representation used in the <b>objectClass</b> index.  Incidentally,
users are even allowed to plugin OID's for the value of the objectClass attribute if they
wanted to.  </p>

<p>Notice Index Tuple keys are all sorted and so are the Tuple values of the same key.
 Indices use matchingRules as well to order both keys and values based on matchingRules in
both the forward and reverse LUTs.  </p>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td>If you look at the objectClass
index you'll see that the OID ( for the 'top' objectClass is not included in the index.
 This is because every entry will contain 'top' so there's no point to bloating the index
including it as a value. If we need to walk all the entries in the server we can walk the
MasterTable or one of the system indices like the ndn or updn indices described <a href="/confluence/display/DIRxSRVx11/Structure+and+Organization"
title="Structure and Organization">here</a>.</td></tr></table></div></td></tr></tbody></table>

<h2><a name="IndexandIndexEntry-MatchingRules%2CComparatorsandNormalizers"></a>MatchingRules,
Comparators and Normalizers</h2>

<p>LDAP defines 3 different kinds of matchingRules for attributeTypes: equality, ordering
and substring.  The equality matchingRule governs how attribute values are compared for equality.
 Ordering matchingRules control how attribute values are compared to see if one value is greater
or less than another.  The substring matching rule manages how embedded fields in larger values
can be matched for patterns.  Most LDAP servers have different kinds of indices you can create
for each of these kinds of matching techniques.  Sometimes there's more for example some servers
have an approximate match index, which uses the soundex algorithm or a derivative to determine

<p>ApacheDS has only one type of index and makes some trade offs in space, time and
simplicity while applying some common sense.  ApacheDS functionally models a matchingRule
as a Java Comparator coupled with a Normalizer.  The Normalizer applies the transformation
required to produce a canonical form.  The Comparator performs checks to see if values are
equal, greater or less than one another.  ApacheDS Index objects use the equality matchingRule's
Normalizer to create index Tuples and the ordering matchingRule's Comparator (only if defined)
to determine insertion order into Index LUT BTrees.  If there exists no ordering matchingRule
for the attributeType then the Comparator for the equality matchingRule is used instead.</p>

<p>If a search uses an equality assertion with the index attribute lookups return the
answers we need. If a search filter uses a substring assertion then depending on the pattern
the substring matchingRules are used (only if defined for the attributeType) with a bit more
processing using regular expressions to match normalized values on index walks. When &gt;
or &lt; filter operators are used, and a ordering matchingRule is defined for the attributeType
then that matchingRule's comparator is used to compare values on Index walks with Cursors.
Also because of these choices ApacheDS indices are bi-directional and so ApacheDS pays a price
on write operations to maintain both forward and reverse LUTs.</p>

<p>ApacheDS does not support approximate match since this kind of matching is language
sensitive and just a pile of horse poop.  For this reason approximate match does not work
with internationalization. No server has yet to satisfy my search when I've tried to phonetically
find names using approximate match.</p>

<p>So ApacheDS made some choices to stay simple.  We applied a pattern of thinking where
trade offs were made to keep the most common operations fastest while the least common operations
may suffer slightly.  Others like approximate match are essentially ignored.  ApacheDS processes
approximate match based filter terms as if they were equality assertions.</p>

<h1><a name="IndexandIndexEntry-JDBMBaseIndexImplementation"></a>JDBM Base
Index Implementation</h1>

<p>The JdbmIndex uses two JdbmTables each backed by it's own JDBM BTree in the same
file managed by a JDBM RecordManager for the file.  These files end with the .db file extension
and use the first alias name of the attributeType for the file name.  The JdbmIndex leverages
these JdbmTables for preforming forward reverse lookup operations to assert values as well
as for generating Cursors over the Tuples of these Tables.</p>

<p>JdbmIndex instances require schema information to properly initialize their LUTs
with the right Comparators.  JdbmIndex operations that write Tuples into these two JdbmTables
normalized values before inserting them.  Attribute values inserted into these JdbmTables
are normalized for speed and efficiency.  If these values were not normalized, the Cursors
walking these JdbmTables and lookup operations would need to normalize then compare.  Search
operations are heavily dependent on Index Cursors and lookup operations and normalization
is an expensive process.  Why repeat normalization and pay on each search operation when you
can pay the price once on write operations.  After all, this is LDAP, and LDAP is read optimized.</p>

<h2><a name="IndexandIndexEntry-PossibleRepeatNormalizationInefficiency"></a>Possible
Repeat Normalization Inefficiency</h2>

<p>JdbmIndex tries to protect against lookups without normalization by normalizing lookup
parameters before consulting the two JdbmTable based LUTs.  It also maintains a cache of normalizations
it has performed recently.  The cache in particular may be a waste since Normalizers also
try to cache recent normalization operations, and these normalizers are reused by the system.
 So we may be needlessly double caching and wasting memory.</p>

<p>Another possible problem is that JdbmIndex instance may not need to normalize attribute
values.  There were several discussions about eagerly normalizing attribute values in entries
especially with the move to the ServerEntry API. If this happens then we'll potentially have
3 copies of the same normalization result in memory during an operation.  Furthermore this
removes the need to cache at all.  A normalization accounting review is required to make sure
we're not wasting extra cycles and memory while defining the correct barriers to protect against
non-normalized attribute values.</p>

<div class='panelMacro'><table class='tipMacro'><colgroup><col width='24'><col></colgroup><tr><td
valign='top'><img src="/confluence/images/icons/emoticons/check.gif" width="16" height="16"
align="absmiddle" alt="" border="0"></td><td><b>Eager Normalization Optimizations</b><br
/>When partitions begin to expose access to Index information, the server now is privy
to information it can use to efficiently conduct eager attribute value normalization in ServerEntry
objects.  At the normalization barrier, which will most likely be in the NormalizationInterceptor,
the server can determine the target partition the entry is associated with.  Knowing the target
partition, the server can check which attributes are indexed by the partition and pre-normalize
them.  Note that the user provided value is never lost, the ServerEntry stores both normalized
and user provided forms. This way value normalization and caching requirements are lifted
from partition indices.</td></tr></table></div>

        <div id="commentsSection" class="wiki-content pageSection">
        <div style="float: right;">
            <a href=""
class="grey">Change Notification Preferences</a>
        <a href="">View
        <a href="">View

View raw message