directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: [Mavibot] Value storage
Date Mon, 16 Sep 2013 11:55:11 GMT
Let's get back to what we need, from the BTree POV...

We have a BTree storing Values associated with Keys. A given Key can be
assoicated to one or many values, but most of the time, we will have
only one value, and if we have many values, we will not have that many.
But occasionally, we may have thousands of values associated with a
single key.

What happens if we have thousands of values for a given key ? It's
become very costly to manipulate the values, and as we have to order the
values to be able to find out if a value is already present in a decent
delay (ie, O(log(n) if the values are ordered, compared to O(n) if they
are not). For this very reason, we would like to use a decent way to
store the values, and this is why we do use a sub-btree (BTree<V, V>).

In ApacheDS and JDBM, we have a threshold number which is used to know
when to create a sub-btree in this case : for instance, if we have more
than 500 values, we will use a sub-BTree, otherwise we use an array of
values. The reason is that up to a point, it's faster to manipulate an
array in memory than a BTree (the tiping poing though is not obvious :
depending on the value's size, this threshold number will vary).

So we want to store :
- a single value most of the case
- an array of values when we have more than 1 value, but below N values
(N to be evaulated)
- a sub-BTree when we have more than N values.

Now, we can simplify this scheme in two ways :
- we can ditch the single value, and use an array instead, and switch to
a sub-BTree at some point
- we can decide not to use an array at all, and use a sub-BTree when we
ave more than one value
- we can also always use a sub-btree when we have an index allowing
muliple values.

So far, we decided to not go with option 3, as it puts a huge strain on
the BTree performances (the very first version was doing exactly taht,
and using something different improved the performances by 15%).

ATM, we are experimenting option 2, but we may later go for option 1.

That beng said, the BTree operations which manipulate the values are the
following :

- getValue(K)
- getValues(K)
- insert(K, V)
- contains(K, V)
- delete(K, V)

As we can see, we have two methods which return values, one whichs
return one single value, and the other which returns all the values. The
question here is why do we have two methods? That forces the users to
know when to use one or the other, ie the users must know if the BTree
allows multiple values.

The suggestion would be to make the getValues() method to return an
iterator. That's fine, but what about the semantic of the getValue()
method, when we have more than one value stored ? Should it returns the
very first value ?

IMO, I think we should have the getValues() to return an iterator, and
the getValue() method to returns either the single value or the first
value if there are more than one.


wdyt ?


Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit :
> I would go with a simple Iterator.
>
> Regards,
> Pierre-Arnaud
>
> On 16 sept. 2013, at 11:51, Emmanuel Lécharny <elecharny@gmail.com> wrote:
>
>> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
>>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny <elecharny@gmail.com>wrote:
>>>
>>>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
>>>>> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny <elecharny@gmail.com
>>>>> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>>
>>>>>>    /**
>>>>>>     * @return The array of stored values.
>>>>>>     */
>>>>>>    V[] getValues();
>>>>>>
>>>>>> shouldn't this be returning a BTree<V,V>? cause we don't support
an
>>>> array
>>>>> as a holder of
>>>>> multiple values
>>>> The fact that it's stored as a BTree is implementation dependent, I'm
>>>> not sure the user of the API should know about it. What the user wants
>>>> is to get back the stored values, and this is convenient to get back an
>>>> array.
>>>>
>>>> IMO, it is not convenient, we cannot copy all the values into an array
>>> and return, and returning an iterator is not going to help either, both will
>>> severely affect the search performance
>>> If needed, we can wrap it an immutable structure and return the BTree
>>> to prevent direct updates, but I wish to differ this until we see the impact
>>> on partition implementation.
>> Except that from the BTree interface, the one the users see, you have :
>>
>>    public V get( K key ) throws IOException, KeyNotFoundException
>>    public DuplicateKeyVal<V> getValues( K key ) throws IOException,
>> KeyNotFoundException
>>
>> so the users has no clue about the underlying data structure.
>>
>> Now, I'm not sure if it makes any sense to expose a getValues() method
>> returning an array. May be an iterator is enough..
>>
>> -- 
>> Regards,
>> Cordialement,
>> Emmanuel Lécharny
>> www.iktek.com 
>>


-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com 


Mime
View raw message