directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kiran Ayyagari <kayyag...@apache.org>
Subject Re: [Mavibot] Value storage
Date Mon, 16 Sep 2013 17:59:17 GMT
On Mon, Sep 16, 2013 at 5:25 PM, Emmanuel Lécharny <elecharny@gmail.com>wrote:

> Let's get back to what we need, from the BTree POV...
>
> We have a BTree storing Values associated with Keys. A given Key can be
> assoicated to one or many values, but most of the time, we will have
> only one value, and if we have many values, we will not have that many.
> But occasionally, we may have thousands of values associated with a
> single key.
>
> What happens if we have thousands of values for a given key ? It's
> become very costly to manipulate the values, and as we have to order the
> values to be able to find out if a value is already present in a decent
> delay (ie, O(log(n) if the values are ordered, compared to O(n) if they
> are not). For this very reason, we would like to use a decent way to
> store the values, and this is why we do use a sub-btree (BTree<V, V>).
>
> In ApacheDS and JDBM, we have a threshold number which is used to know
> when to create a sub-btree in this case : for instance, if we have more
> than 500 values, we will use a sub-BTree, otherwise we use an array of
> values. The reason is that up to a point, it's faster to manipulate an
> array in memory than a BTree (the tiping poing though is not obvious :
> depending on the value's size, this threshold number will vary).
>
> So we want to store :
> - a single value most of the case
> - an array of values when we have more than 1 value, but below N values
> (N to be evaulated)
> - a sub-BTree when we have more than N values.
>
> Now, we can simplify this scheme in two ways :
> - we can ditch the single value, and use an array instead, and switch to
> a sub-BTree at some point
> - we can decide not to use an array at all, and use a sub-BTree when we
> ave more than one value
> - we can also always use a sub-btree when we have an index allowing
> muliple values.
>
> So far, we decided to not go with option 3, as it puts a huge strain on
> the BTree performances (the very first version was doing exactly taht,
> and using something different improved the performances by 15%).
>
> ATM, we are experimenting option 2, but we may later go for option 1.
>
> That beng said, the BTree operations which manipulate the values are the
> following :
>
> - getValue(K)
> - getValues(K)
> - insert(K, V)
> - contains(K, V)
> - delete(K, V)
>
> As we can see, we have two methods which return values, one whichs
> return one single value, and the other which returns all the values. The
> question here is why do we have two methods? That forces the users to
> know when to use one or the other, ie the users must know if the BTree
> allows multiple values.
>
> The suggestion would be to make the getValues() method to return an
> iterator. That's fine, but what about the semantic of the getValue()
> method, when we have more than one value stored ? Should it returns the
> very first value ?
>
yeap, this is what it does now, the first value will be returned if there
are more than
one value

>
> IMO, I think we should have the getValues() to return an iterator, and
> the getValue() method to returns either the single value or the first
> value if there are more than one.
>
> I still prefer to get a BTree(that gives us better control if we want to
search for a
value in BTree)
otherwise why not return a cursor? (personally, the last preference)

>
> wdyt ?
>
>
> Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit :
> > I would go with a simple Iterator.
> >
> > Regards,
> > Pierre-Arnaud
> >
> > On 16 sept. 2013, at 11:51, Emmanuel Lécharny <elecharny@gmail.com>
> wrote:
> >
> >> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit :
> >>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny <
> elecharny@gmail.com>wrote:
> >>>
> >>>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit :
> >>>>> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny <
> elecharny@gmail.com
> >>>>> wrote:
> >>>>>
> >>>>>> Hi guys,
> >>>>>>
> >>>>>>
> >>>>>>    /**
> >>>>>>     * @return The array of stored values.
> >>>>>>     */
> >>>>>>    V[] getValues();
> >>>>>>
> >>>>>> shouldn't this be returning a BTree<V,V>? cause we don't
support an
> >>>> array
> >>>>> as a holder of
> >>>>> multiple values
> >>>> The fact that it's stored as a BTree is implementation dependent, I'm
> >>>> not sure the user of the API should know about it. What the user wants
> >>>> is to get back the stored values, and this is convenient to get back
> an
> >>>> array.
> >>>>
> >>>> IMO, it is not convenient, we cannot copy all the values into an array
> >>> and return, and returning an iterator is not going to help either,
> both will
> >>> severely affect the search performance
> >>> If needed, we can wrap it an immutable structure and return the BTree
> >>> to prevent direct updates, but I wish to differ this until we see the
> impact
> >>> on partition implementation.
> >> Except that from the BTree interface, the one the users see, you have :
> >>
> >>    public V get( K key ) throws IOException, KeyNotFoundException
> >>    public DuplicateKeyVal<V> getValues( K key ) throws IOException,
> >> KeyNotFoundException
> >>
> >> so the users has no clue about the underlying data structure.
> >>
> >> Now, I'm not sure if it makes any sense to expose a getValues() method
> >> returning an array. May be an iterator is enough..
> >>
> >> --
> >> Regards,
> >> Cordialement,
> >> Emmanuel Lécharny
> >> www.iktek.com
> >>
>
>
> --
> Regards,
> Cordialement,
> Emmanuel Lécharny
> www.iktek.com
>
>


-- 
Kiran Ayyagari
http://keydap.com

Mime
View raw message