Return-Path: Delivered-To: apmail-directory-dev-archive@www.apache.org Received: (qmail 79105 invoked from network); 2 Feb 2008 07:19:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Feb 2008 07:19:40 -0000 Received: (qmail 51361 invoked by uid 500); 2 Feb 2008 07:19:31 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 51315 invoked by uid 500); 2 Feb 2008 07:19:30 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 51304 invoked by uid 99); 2 Feb 2008 07:19:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Feb 2008 23:19:30 -0800 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of akarasulu@gmail.com designates 72.14.204.239 as permitted sender) Received: from [72.14.204.239] (HELO qb-out-0506.google.com) (72.14.204.239) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Feb 2008 07:19:00 +0000 Received: by qb-out-0506.google.com with SMTP id o21so1301372qba.9 for ; Fri, 01 Feb 2008 23:19:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; bh=cg/pMIRjoaK4IoQtCX16YRktFsAZr1nR3Fg4ztrnshc=; b=i3SqMXkxW1tsp8skO9WAEv1vx9NQCFFFP+XldaF5+2WfoKy/ZV91wFmfYxXx+vbeoi09dNhsERyyXl3F2cc8Mkn3aPNWVDHdFgGicq0IIHq5RNuCF5naUuNMdikQAFRhdYDLhBoSN3kQAWrmNy4M2tH6TAxopg0mlsBOJFvCj+A= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:in-reply-to:mime-version:content-type:references:x-google-sender-auth; b=rTaW/HrIaGVYJnfKZ2SpjL6M7El0x4mrgcZ6jCRtcnxAUjSrkDIxRkkpj9OcFvXpzYd3hqMUuweDaDTDwIC8F7qacPvP/b9mPe29s4jO0EwmiffGJWV3WIs8BfF9JbILNzhEDOOiuYuWzkMQFjaet8EofsP9Ey3XPMEwQ4gmQ/8= Received: by 10.142.44.11 with SMTP id r11mr2683525wfr.102.1201936746631; Fri, 01 Feb 2008 23:19:06 -0800 (PST) Received: by 10.143.166.2 with HTTP; Fri, 1 Feb 2008 23:19:06 -0800 (PST) Message-ID: Date: Sat, 2 Feb 2008 02:19:06 -0500 From: "Alex Karasulu" Sender: akarasulu@gmail.com To: "Apache Directory Developers List" Subject: Re: [ApacheDS] Going to need to implement a splay tree In-Reply-To: <47A074DC.8090109@gmail.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_33833_18073604.1201936746625" References: <47A074DC.8090109@gmail.com> X-Google-Sender-Auth: 90d6bf0f897e279a X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_33833_18073604.1201936746625 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sorry for taking so long to get to this email. More inline ... On Jan 30, 2008 8:00 AM, Emmanuel Lecharny wrote: > Hi Alex, > > I will comment your mail extensively, with some ideas I have. > > Alex Karasulu wrote: > > Background: > > > > ApacheDS used JDBM a B+Tree implementation in it's default partition > > implementation. JDBM does not allow duplicate keys (same key with many > > values) and so we abstract this using the Table interface. Tables > > support duplicate keys. So to allow this JdbmTable which implements > > this interface uses a TreeSet or something called a BTreeRedirect to > > manage the values of duplicate keys. When some threshold of values is > > reached like say 50 for the same key, the JDBM table switches from > > using a TreeSet to using a BTreeRedirect. A BTreeRedirect is a pointer > > to another BTree in the same db file. The BTree pointed to contains > > the values of the duplicate key as it's keys. > > > > Problem: > > > > I was looking into the new Cursor interface and came to the > > realization that we can no longer use TreeSets efficiently or properly > > for that matter. To build a Cursor on this we would have to be able > > to traverse up and down as the user of the Cursor desired and there's > > simply no way to do this. > Considering that a treeset internally contains a NavigableMap, traverse > a TreeSet is possible. The problem is that you can't move backward as > soon as you have moved forward, without fetching a new reference of the > TreeSet. It's basically an enumaration. The question is : do we need to > move back and forth ? First NavigableMap was introduced in JDK 6 so it's not going to be an optio= n if we intend to stick to JDK 5. Secondly I think we're going to need to be able to advance to any position in the Cursor, this was one of the main reasons for creating this interface. This way you don't need to be going back to the partition every time to do something. We have decoupled the request to enumerate through indices or entries from the act of requesting an enumeration and the partition. This will help move the search functionality out of partitions and into the core. > > > > > Proposal: > > > > Implement an in memory linked splay tree with splay Cursor using the > > links for traversal. Splay trees are idea for recurring lookups which > > would be the case for us. Furthermore splay trees are great for > > caches. Splay trees reposition the most frequently accessed nodes > > close to the root of the tree making access times faster. They > > perform poorly though when inserting in an ordered fashion. I think > > this is acceptable for our use of this data structure for a TreeSet > > (which uses a red-black tree) and ideal for use in devising a better > > entry cache higher up in the server. > > > > Thoughts? > These thoughts below are unrelated to the problem but they answer other questions with some good insight. You're talking here about the way we designed the default partition for facilitating search with various system and user indices. Also this a secondary storage mechanism not one that we need for in memory above. What I am talking about is a simple binary searc= h tree implementation to replace TreeSets where we can add our own Cursor to enumerate through it properly. I'll answer these points below in another thread. Alex > There are many different things to consider. First, let's divide, then > conquer... > > 1) We have many different kind of data we want to manipulate. Let's list > them : > o Entries. They are stored into the MasterTable. An entry is a > serialized Attribute, associated with a unique identifier, which is > currently a Long (which allows our server to store up to 18 446 744 073 > 709 551 615 entries ... 18 quintillions !). Those long must be fetched > quickly, as we will always get an entry by its ID. Using a BTree for > that is time consuming, as fetching an entry will be done in O(log(N)). > We should use a Hash to speedup entries retrieval (of course, at the > risk that the worst case scenario can transform a search operation to > cost N read, but this is unlikely...). The only point to consider is > that this ID is incremental, so the hash function must be chosen > carefully. > > o DNs. This is a very specific case. DNs are unique, mono valued, > unordered elements. DNs are not attributes, too. It's unordered, as a DN > order depends on the RDN attribute's own order, which can be different > depending on the RDN's attribute. The consequence is that we should use > a Hash to store the DNs. > > o Attributes with single values. Typically, 'uid'. There are supposed to > be unique, too (ie, you can imagine that an uid can be found more than > once in a LDAP server (there is no reason why you should not be allowed > to store a user's UID in many places, if this user is registred in more > than once... For instance, I use the same uid on several computers, so I > may have to store these UID in different branches in my LDAP server). We > can use a Hash table to store such an attribute, because it's supposed > to be unique. If you have more than one reference, then the question > will be : how do I manage more than one value associated with this > attribute (cf next point). > Using a Hash table here means you know that this attribute is not only > holding a single value, but also it's unique within the whole server. > Dangerous ... > > o All other attributes (including single valued attributes which can be > found more than once) : HashMap is not really a good choice, except if > the admin *knows* that the number of values won't be enormous. And we > have specific cases where you have a Attribute <-> Value relation where > an attribute may have thousands (even millions !) of values. The > 'member' attribute is a good example. But we also have to consider the > ATtribute <-> entries relation. A search request like > (ObjectClass=3D'person') will use the ObjectClass index, and get all the > entries containing the 'person' value for tje ObjectClass attribute. > Potentially, thousands... The current implementation, as explained by > Alex, instead of storing a list of all the entry IDs linked to the > 'person' value, stores a TreeSet, which is managed in a separate file. > There is an indirection, a little bit like when you deal with a N-N > relation in a RDBMS context (when two tables are related with a N-N > relation, then you create a third table holding the relation between > both IDs). > Can we do better ? Because this solution is costly, and not really > effective : you have to deal with the structure duality (you hold either > a list or a reference to a treeset, depending on the number of > elements), and this make it more complex to manage cache (i'm not sure > at this point that we can cache such elements...) > Alex proposal (using Splay trees) won't solve the problem, IMHO. It's > just a better way to store data in memory, with extended navigation > possibilities, but it does not help when it comes to store this > structure on the disk. As the data are moved frequently, even on read, > this will increase the cost of searching so much that it will kill the > main advantage of LDAP vs RDBMS. So, yes, we can use Splay trees in > memory, but we can't store splay trees on disk. > > Some other points to take into consideration is also the fact that we > may run short of memory, but we don't want the server to stop running, > just because we got an OOM exception. One way to handle this would be to > use weak references for the cache.I have no idea what it may imply for a > Splay tree implementation, but this must be studied. > > Now, for those attributes, I have described another solution, based on > the fact that we use 'long' to store IDs : > http://cwiki.apache.org/confluence/display/DIRxSRVx11/Backend. I think > this might be interesting to dig those ideas a little bit to see if it > fits our need. > > > Note : Alex's need is more urgent, so I guess that we won't be able to > implement something very different than what we currently have. I just > wanted to present some of the ideas i'm playing with for years now... > (but I never had enough time to go any further than drafting some small > pages on confluence ;) > > > -- > -- > cordialement, regards, > Emmanuel L=E9charny > www.iktek.com > directory.apache.org > > > ------=_Part_33833_18073604.1201936746625 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Sorry for taking so long to get to this email.  More inline ...
On Jan 30, 2008 8:00 AM, Emmanuel Lecharny <= elecharny@gmail.com> wrote:
Hi Alex,

I will comment your mail extensively, with some ideas I hav= e.

Alex Karasulu wrote:
> Background:>
> ApacheDS used JDBM a B+Tree implementation in it's defaul= t partition
> implementation. JDBM does not allow duplicate keys (same key with many=
> values) and so we abstract this using the Table interface.  T= ables
> support duplicate keys.  So to allow this JdbmTable whic= h implements
> this interface uses a TreeSet or something called a BTreeRedirect to> manage the values of duplicate keys.  When some threshold of va= lues is
> reached like say 50 for the same key, the JDBM table switch= es from
> using a TreeSet to using a BTreeRedirect. A BTreeRedirect is a pointer=
> to another BTree in the same db file. The BTree pointed to contain= s
> the values of the duplicate key as it's keys.
>
> Problem:
>
> I was looking into the new Cursor interface a= nd came to the
> realization that we can no longer use TreeSets effic= iently or properly
> for that matter.  To build a Cursor on this= we would have to be able
> to traverse up and down as the user of the Cursor desired and there= 9;s
> simply no way to do this.
Considering that a treeset i= nternally contains a NavigableMap, traverse
a TreeSet is possible. The p= roblem is that you can't move backward as
soon as you have moved forward, without fetching a new reference of the
= TreeSet. It's basically an enumaration. The question is : do we need to=
move back and forth ?

First NavigableMap was intro= duced in JDK 6 so it's not going to be an option if we intend to stick = to JDK 5.  Secondly I think we're going to need to be able to adva= nce to any position in the Cursor, this was one of the main reasons for cre= ating this interface.  This way you don't need to be going back to= the partition every time to do something.  We have decoupled the requ= est to enumerate through indices or entries from the act of requesting an e= numeration and the partition.  This will help move the search function= ality out of partitions and into the core.
 

<= div class=3D"Ih2E3d">>
> Proposal:
>
> Implement an in= memory linked splay tree with splay Cursor using the
> links for traversal.  Splay trees are idea for recurring lookups = which
> would be the case for us.  Furthermore splay trees are g= reat for
> caches.  Splay trees reposition the most frequently a= ccessed nodes
> close to the root of the tree making access times faster.  They> perform poorly though when inserting in an ordered fashion. I think<= br>> this is acceptable for our use of this data structure for a TreeSet=
> (which uses a red-black tree) and ideal for use in devising a better> entry cache higher up in the server.
>
> Thoughts?
=

These thoughts below are unrelated to the problem but= they answer other questions with some good insight.  You're talki= ng here about the way we designed the default partition for facilitating se= arch with various system and user indices.  Also this a secondary stor= age mechanism not one that we need for in memory above.  What I am tal= king about is a simple binary search tree implementation to replace TreeSet= s where we can add our own Cursor to enumerate through it properly.

I'll answer these points below in another thread.
 
Alex=


There are many different things to consider. First, let's div= ide, then
conquer...

1) We have many different kind of data we wa= nt to manipulate. Let's list
them :
o Entries. They are stored in= to the MasterTable. An entry is a
serialized Attribute, associated with a unique identifier, which is
curr= ently a Long (which allows our server to store up to 18 446 744 073
709 = 551 615 entries ... 18 quintillions !). Those long must be fetched
quick= ly, as we will always get an entry by its ID. Using a BTree for
that is time consuming, as fetching an entry will be done in O(log(N)).
= We should use a Hash to speedup entries retrieval (of course, at the
ris= k that the worst case scenario can transform a search operation to
cost = N read, but this is unlikely...). The only point to consider is
that this ID is incremental, so the hash function must be chosen carefully.=

o DNs. This is a very specific case. DNs are unique, mono valued,unordered elements. DNs are not attributes, too. It's unordered, as a= DN
order depends on the RDN attribute's own order, which can be different<= br>depending on the RDN's attribute. The consequence is that we should = use
a Hash to store the DNs.

o Attributes with single values. Typ= ically, 'uid'. There are supposed to
be unique, too (ie, you can imagine that an uid can be found more than
o= nce in a LDAP server (there is no reason why you should not be allowed
t= o store a user's UID in many places, if this user is registred in more<= br> than once... For instance, I use the same uid on several computers, so Imay have to store these UID in different branches in my LDAP server). Wecan use a Hash table to store such an attribute, because it's suppose= d
to be unique. If you have more than one reference, then the question
wil= l be : how do I manage more than one value associated with this
attribut= e (cf next point).
Using a Hash table here means you know that this attr= ibute is not only
holding a single value, but also it's unique within the whole server.Dangerous ...

o All other attributes (including single valued attr= ibutes which can be
found more than once) : HashMap is not really a good= choice, except if
the admin *knows* that the number of values won't be enormous. And wehave  specific cases where you have a Attribute <-> Value rela= tion where
an attribute may have thousands (even millions !) of values. = The
'member' attribute is a good example. But we also have to consider = the
ATtribute <-> entries relation. A search request like
(Obje= ctClass=3D'person') will use the ObjectClass index, and get all the=
entries containing the 'person' value for tje ObjectClass attribute= .
Potentially, thousands... The current implementation, as explained by<= br>Alex, instead of storing a list of all the entry IDs linked to the
'person' value, stores a TreeSet, which is managed in a separate fi= le.
There is an indirection, a little bit like when you deal with a N-N<= br>relation in a RDBMS context (when two tables are related with a N-N
relation, then you create a third table holding the relation between
bot= h IDs).
Can we do better ? Because this solution is costly, and not real= ly
effective : you have to deal with the structure duality (you hold eit= her
a list or a reference to a treeset, depending on the number of
elements)= , and this make it more complex to manage cache (i'm not sure
at thi= s point that we can cache such elements...)
Alex proposal (using Splay t= rees) won't solve the problem, IMHO. It's
just a better way to store data in memory, with extended navigation
poss= ibilities, but it does not help  when it comes to store this
struct= ure on the disk. As the data are moved frequently, even on read,
this wi= ll increase the cost of searching so much that it will kill the
main advantage of LDAP vs RDBMS. So, yes, we can use Splay trees in
memo= ry, but we can't store splay trees on disk.

Some other points to= take into consideration is also the fact that we
may run short of memor= y, but we don't want the server to stop running,
just because we got an OOM exception. One way to handle this would be touse weak references for the cache.I have no idea what it may imply for aSplay tree implementation, but this must be studied.

Now, for thos= e attributes, I have described another solution, based on
the fact that we use 'long' to store IDs :
http= ://cwiki.apache.org/confluence/display/DIRxSRVx11/Backend. I think
t= his might be interesting to dig those ideas a little bit to see if it
fits our need.


Note : Alex's need is more urgent, so I guess= that we won't be able to
implement something very different than wh= at we currently have. I just
wanted to present some of the ideas i'm= playing with for years now...
(but I never had enough time to go any further than drafting some small
= pages on confluence ;)


--
--
cordi= alement, regards,
Emmanuel L=E9charny
www.iktek.com
directory.apache.= org



------=_Part_33833_18073604.1201936746625--