From dev-return-33372-apmail-directory-dev-archive=directory.apache.org@directory.apache.org Wed May 12 15:42:37 2010 Return-Path: Delivered-To: apmail-directory-dev-archive@www.apache.org Received: (qmail 24055 invoked from network); 12 May 2010 15:42:36 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 May 2010 15:42:36 -0000 Received: (qmail 65538 invoked by uid 500); 12 May 2010 15:42:36 -0000 Delivered-To: apmail-directory-dev-archive@directory.apache.org Received: (qmail 65359 invoked by uid 500); 12 May 2010 15:42:35 -0000 Mailing-List: contact dev-help@directory.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Apache Directory Developers List" Delivered-To: mailing list dev@directory.apache.org Received: (qmail 65351 invoked by uid 99); 12 May 2010 15:42:35 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 15:42:35 +0000 X-ASF-Spam-Status: No, hits=-0.5 required=10.0 tests=AWL,FREEMAIL_FROM,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of elecharny@gmail.com designates 74.125.82.178 as permitted sender) Received: from [74.125.82.178] (HELO mail-wy0-f178.google.com) (74.125.82.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 May 2010 15:42:28 +0000 Received: by wyb40 with SMTP id 40so150343wyb.37 for ; Wed, 12 May 2010 08:42:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:reply-to :user-agent:mime-version:to:subject:content-type :content-transfer-encoding; bh=7q/B6x8F9HJN9/sgG+WYvtnbTBAXTwGiMjwktOSlbt0=; b=AExhOuRBHJ4OxR56m54p4IfsrMIoqvztI5lUAlWKu8DrNBUK4r0XTDffixEsrvYgI4 UXQbhby/mhZzMYRhJOYrFHjAnGFyKGnOwFXJBRd93APCADWrV0n2N2zgpYFQR2i025ey s9ai0H7dX5I1gPYGbqP2znVnhAAQwjsXAAWik= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:reply-to:user-agent:mime-version:to:subject :content-type:content-transfer-encoding; b=vQ7b46he2LF4Swth/WQiVpNHkSyxSAtvSOwlOZPvZavD6/V1A4Csaq006SixKEujKQ XuyjE167tjkz1FytUBVAKhUZszAETvP++wwB2EnqsEBsRdIcx0P7UBLP8kPlMy+RfmtX T2j0a7m4mCzNKE4iGyst57YTDyKtXF+9Anbww= Received: by 10.227.69.213 with SMTP id a21mr6907824wbj.220.1273678926416; Wed, 12 May 2010 08:42:06 -0700 (PDT) Received: from emmanuel-lecharnys-MacBook-Pro.local (lon92-10-78-226-4-211.fbx.proxad.net [78.226.4.211]) by mx.google.com with ESMTPS id l23sm1800992wbb.8.2010.05.12.08.42.04 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 12 May 2010 08:42:05 -0700 (PDT) Message-ID: <4BE6D7CB.7020502@gmail.com> Date: Sun, 09 May 2010 17:42:03 +0200 From: Emmanuel Lecharny Reply-To: elecharny@apache.org User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: Apache Directory Developers List Subject: About reverse index ... Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Hi guys, in an attempt to jump into the backend database, and due to the problem we have on the add operation, I looked at the current implementation (based on JDBM) and the way we use it. Right now, we have one master table containing all the entries, plus many indexes, some of them being system indexes (CSN, UUID, RDN, etc) and other being user defined. Each of these index are a composition of two tables : - a forward index : from a key, you get a link to entries in the Master Table (MT) - a reverse index : from an entry ID, you get a link to all the contained values As soon as you consider that a delete operation will need to update all the index the deleted entry uses, you see that you can benefit from having such reverse index : you don't have to grab the entry from the backend, as you already have it's ID, which is all what you need. Thus, a delete operation is just about doing : - get the entry ID - for each index, if we have an stored, then grab the associated values, and for each value, delete them from the forward index. - delete the entry from the MT Sounds like you have avoid a fetch from the MT, but you have to pay an heavy penalty for that : - first, has you have no idea about how many index are used for this entry (suppose that the entry does not contain the optional indexed 'cn' attribute), you still have to check in the index. - second, you have to maintain 2 tables for each index IMO, the cost is way to expensive compared to the basic approach : grab the entry. Now, it's not enough to say 'kill the reverse index !'. There is one more reason why we want to have this reverse index, the question is : does it brings a lot of benefit ? So why do we need this reverse index ? We discussed about it with Stefan, and here is what it is used for : Suppose you have a search request with a filter like (&(ObjectClass=XXX)(cn=YYY)). The search engine will evaluate the number of entries each of those filters node will get back. Suppose it's N for the OC filter, and M for the cn filter. Let's say that N > M. Now, we will loop on the M entries to check if they fit the OC filter. How do we do that ? The first approach would be to grab the entry, and check in memory of the filter (ObjectClass=XXX) match the entry. If not, we ditch the entry, otherwise, itas a valid candidate. The second option is to use the reverse index : we have the entry ID (we havdn't grabbed the entry from disk yet), and we can see if the OC table contains a reference for this entry ID. If not, we can move to the next entry ID. Otherwise, we can grab the entry and return it. Obviously there is some potential for a speedup. Now, let's consider the cases where we benefit from not grabbing the entry. 1) We grab all the entry using the smallest set selected with the filters pros : - Fast entry filtering, it's just a question on applying the filters on the entry - It can be used very late, as we have already grabbed the entry. The entry selection can not only be based on the filter, but can also be combined with virtual attributes which has been added on the grabbed entry - That means we can move the search engine out of the backend, and have it working on the top of the Interceptor chain cons : - We may read way more entries than necessary 2) we just get the Entry ID and use the reverse index to select the entries pros : - we don't grab the entries unless absolutely necessary cons : - we may have to check in many indexes (as many as we have exprNodes in the filter expression, assuming that each of them are indexed), which means lot of O(log(N)) operations on these index (assuming that all those index are in memory, otherwise, if we hit the disk, the benefit is null) - as soon as we have one single non indexed exprNode in the filter we have to check, then we will have to grab the entry anyway. The benefit is that we may save some disk access. - of course, we have to maintain all the reverse indexes. - if the exprNode is associated with an attribute not present in te entry, we do a lookup in the index for nothing - if the intersection between the two exprNode is big (ie Nb(ObjectClass=XXX) inter Nb(cn=YYY)), then the gin will be low, and if this itersection is small, then it's likely that the smallest set has been already selected as the main index to use to grab entries, thus leading to a small number of entries to grab. So here is what I suggest : - get rid of those reverse index - or, at least, make it optionnal thoughts ? -- Regards, Cordialement, Emmanuel Lécharny www.nextury.com