Mailing-List: contact dev-help@directory.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Apache Directory Developers List" <dev@directory.apache.org>
Received-SPF: pass (athena.apache.org: domain of pajbam@gmail.com designates
 74.125.82.178 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=sender:content-type:mime-version:subject:from:in-reply-to:date
         :content-transfer-encoding:message-id:references:to:x-mailer;
        b=ph3zSDH17zJzOhooWRErKxKpJyo2CyBaPlqdXVvz3IpoJkQ4JLBil+s7fJ2LM/s9t6
         xZ0Z/unEBuVlBfd6m15Ah2gc92Dbp+EJnsvLaroJQnZEfe6sdjtZzlCXg0ymOOZoS7NM
         /l6D9+OMPdv+J6GoIxUXt6jtUzWXlkL3lSuYY=
Sender: Pierre-Arnaud Marcelot <pajbam@gmail.com>
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1078)
Subject: Re: About reverse index ...
From: Pierre-Arnaud Marcelot <pa@marcelot.net>
In-Reply-To: <4BE6D7CB.7020502@gmail.com>
Date: Wed, 12 May 2010 18:20:25 +0200
Content-Transfer-Encoding: quoted-printable
Message-Id: <19E7A62A-FA70-420B-8860-563FC58892C0@marcelot.net>
References: <4BE6D7CB.7020502@gmail.com>
To: "Apache Directory Developers List" <dev@directory.apache.org>


On 9 mai 2010, at 17:42, Emmanuel Lecharny wrote:

> Hi guys,
>=20
> in an attempt to jump into the backend database, and due to the =
problem we have on the add operation, I looked at the current =
implementation (based on JDBM) and the way we use it.
>=20
> Right now, we have one master table containing all the entries, plus =
many indexes, some of them being system indexes (CSN, UUID, RDN, etc) =
and other being user defined.
>=20
> Each of these index are a composition of two tables :
> - a forward index : from a  key, you get a link to entries in the =
Master Table (MT)
> - a reverse index : from an entry ID, you get a link to all the =
contained values
>=20
> As soon as you consider that a delete operation will need to update =
all the index the deleted entry uses, you see that you can benefit from =
having such reverse index : you don't have to grab the entry from the =
backend, as you already have it's ID, which is all what you need. Thus, =
a delete operation is just about doing :
> - get the entry ID
> - for each index, if we have an <Entry ID> stored, then grab the =
associated values, and for each value, delete them from the forward =
index.
> - delete the entry from the MT
>=20
> Sounds like you have avoid a fetch from the MT, but you have to pay an =
heavy penalty for that :
> - first, has you have no idea about how many index are used for this =
entry (suppose that the entry does not contain the optional indexed 'cn' =
attribute), you still have to check in the index.
> - second, you have to maintain 2 tables for each index
>=20
> IMO, the cost is way to expensive compared to the basic approach : =
grab the entry.
>=20
> Now, it's not enough to say 'kill the reverse index !'. There is one =
more reason why we want to have this reverse index, the question is : =
does it brings a lot of benefit ?
>=20
> So why do we need this reverse index ? We discussed about it with =
Stefan, and here is what it is used for : Suppose you have a search =
request with a filter like (&(ObjectClass=3DXXX)(cn=3DYYY)).
>=20
> The search engine will evaluate the number of entries each of those =
filters node will get back. Suppose it's N for the OC filter, and M for =
the cn filter. Let's say that N > M.
> Now, we will loop on the M entries to check if they fit the OC filter.
>=20
> How do we do that ? The first approach would be to grab the entry, and =
check in memory of the filter (ObjectClass=3DXXX) match the entry. If =
not, we ditch the entry, otherwise, itas a valid candidate.
>=20
> The second option is to use the reverse index : we have the entry ID =
(we havdn't grabbed the entry from disk yet), and we can see if the OC =
table contains a reference for this entry ID. If not, we can move to the =
next entry ID. Otherwise, we can grab the entry and return it.
>=20

I would add a third option where you only get the Entry IDs from the =
first filter and use these IDs to verify the presence of these entries =
in the BTree obtained from the forward index of the other filter.
At the end you only grab entries from the master table that are really =
matching the full filter.

> Obviously there is some potential for a speedup. Now, let's consider =
the cases where we benefit from not grabbing the entry.
>=20
> 1) We grab all the entry using the smallest set selected with the =
filters
> pros :
> - Fast entry filtering, it's just a question on applying the filters =
on the entry
> - It can be used very late, as we have already grabbed the entry. The =
entry selection can not only be based on the filter, but can also be =
combined with virtual attributes which has been added on the grabbed =
entry
> - That means we can move the search engine out of the backend, and =
have it working on the top of the Interceptor chain
>=20
> cons :
> - We may read way more entries than necessary
>=20
> 2) we just get the Entry ID and use the reverse index to select the =
entries
> pros :
> - we don't grab the entries unless absolutely necessary
>=20
> cons :
> - we may have to check in many indexes (as many as we have exprNodes =
in the filter expression, assuming that each of them are indexed), which =
means lot of O(log(N)) operations on these index (assuming that all =
those index are in memory, otherwise, if we hit the disk, the benefit is =
null)
> - as soon as we have one single non indexed exprNode in the filter we =
have to check, then we will have to grab the entry anyway. The benefit =
is that we may save some disk access.
> - of course, we have to maintain all the reverse indexes.
> - if the exprNode is associated with an attribute not present in te =
entry, we do a lookup in the index for nothing
> - if the intersection between the two exprNode is big (ie =
Nb(ObjectClass=3DXXX) inter Nb(cn=3DYYY)), then the gin will be low, and =
if this itersection is small, then it's likely that the smallest set has =
been already selected as the main index to use to grab entries, thus =
leading to a small number of entries to grab.
>=20
>=20
> So here is what I suggest :
> - get rid of those reverse index
> - or, at least, make it optionnal
>=20
> thoughts ?

I think we can live without them.
If the implementation without them is too slow, then we could bring them =
back.

My 2 cents,
Pierre-Arnaud=20

>=20
> --=20
> Regards,
> Cordialement,
> Emmanuel L=E9charny
> www.nextury.com
>=20
>=20