directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@apache.org>
Subject Re: HBase partition integration in trunks ?
Date Tue, 16 Aug 2011 13:38:01 GMT
On Tue, Aug 16, 2011 at 4:33 PM, Emmanuel L├ęcharny <elecharny@apache.org>wrote:

> On 8/16/11 3:14 PM, Kiran Ayyagari wrote:
>
>> On Tue, Aug 16, 2011 at 6:27 PM, Emmanuel Lecharny<elecharny@gmail.com>
>>  wrote:
>>
>>> On 8/16/11 2:22 PM, Kiran Ayyagari wrote:
>>>
>>>> On Tue, Aug 16, 2011 at 1:23 PM, Emmanuel L├ęcharny<elecharny@apache.org
>>>> >
>>>>  wrote:
>>>>
>>>>> On 8/15/11 5:59 PM, Stefan Seelmann wrote:
>>>>>
>>>>>> Now I have to update the parts that are a bit special, let me explain:
>>>>>> In HBase partition I didn't use one-level and sub-level indices,
but
>>>>>> use the RDN index table instead. I also extended the search engine
in
>>>>>> that way that one-level and sub-level cursors get the search filter
in
>>>>>> order to perform filtering within the store instead of returning
all
>>>>>> candidates and evaluate them.
>>>>>>
>>>>> Some toughts about this one-level/sub-level index.
>>>>>
>>>>> Using the Rdn index makes perfect sense : we have the Rdn ->    parent
>>>>> relation
>>>>> plus the parent ->    children relation in this index, so there is
no
>>>>> need
>>>>> to
>>>>> have a one level index (all the children are already listed in the RDN
>>>>> index
>>>>> for a specific entry). I'm a bit more concerned about the sub-level
>>>>> processing : we have to recurse on all the children to get all the
>>>>> candidates. That's fine, we can easily implement that (and you already
>>>>> did),
>>>>> but what concerns me is that we don't have the count of all the
>>>>> entries,
>>>>> we
>>>>> will have to compute them. This count is necessary in the search engine
>>>>> to
>>>>> select the index we will use to walk the entries.
>>>>>
>>>>> One solution would be to store two more elements in the ParentIdAndRdn
>>>>> data
>>>>> structure : the number of children directly below the RDN, and the
>>>>> number
>>>>> of
>>>>> children and descendant. That would probably solve the issue I'm
>>>>> mentioning.
>>>>> Of course, that also means we wil have to update all the RDN hierarchy
>>>>> from
>>>>> top to bottom (but affecting only the RDN part of the entry DN) each
>>>>> time
>>>>> we
>>>>> add/move/delete an entry. Note that we already do that for the oneLevel
>>>>> and
>>>>> Sublevel index.
>>>>>
>>>>>  Just to make a point:
>>>> I think, in the case of achieving SubLevel index evaluation with RDN
>>>> index it becomes a costly and complex operation
>>>> (recursive scanning and updating) where as with the current sublevel
>>>> index it takes O(1) to fetch all the sublevel children of
>>>> an entry.
>>>> Not sure if HBase has any features to solve in an efficient manner
>>>>
>>> If you think about what is done when we add an entry in the current code
>>> base :
>>> - add the entry in the MasterTable
>>> - add the RDN/parent into the RdnIndex
>>> - update the one-level index with the newly added entry reference,
>>> increasing the number of children for the parent
>>> - for each RDN in the parent, update the sub-level index with the newly
>>> added entry reference, increasing the number of children for the parent
>>>
>>> If we compare what we would do if we remove the one-level/sub-level index
>>> :
>>> - add the entry in the MasterTable
>>> - add the RDN/parent into the RdnIndex
>>> - for each RDN in the parent, update the rdn index with the newly added
>>> entry reference, increasing the number of children for the parent
>>>
>>>  this count needs to be updated in more than one ParentAndRdn entry(for
>> sublevel indexing feature only) depending on the level
>> at which the child entry was added(cause each entry in the ancestor
>> chain is a valid base to perform sublevel search)
>> and also when it comes to fetch all the sublevel entries we need to
>> scan recursively.
>>
> Right, and this is why I added a 'for each RDN ' in the last operation.
> Btw, this is exactly the same kind of cost we already have to pay to update
> the sub-level index.


Yep the costs are not very different in terms of processing but the IO
advantage as you mentioned is huge. I think having most of the Rdn index in
cache this will fly big time.

Also just having numbers with some simple tests will show us the real facts.
But this approach sound healthy.

Best Regards,
-- Alex

Mime
View raw message