Let's get back to what we need, from the BTree POV... We have a BTree storing Values associated with Keys. A given Key can be assoicated to one or many values, but most of the time, we will have only one value, and if we have many values, we will not have that many. But occasionally, we may have thousands of values associated with a single key. What happens if we have thousands of values for a given key ? It's become very costly to manipulate the values, and as we have to order the values to be able to find out if a value is already present in a decent delay (ie, O(log(n) if the values are ordered, compared to O(n) if they are not). For this very reason, we would like to use a decent way to store the values, and this is why we do use a sub-btree (BTree). In ApacheDS and JDBM, we have a threshold number which is used to know when to create a sub-btree in this case : for instance, if we have more than 500 values, we will use a sub-BTree, otherwise we use an array of values. The reason is that up to a point, it's faster to manipulate an array in memory than a BTree (the tiping poing though is not obvious : depending on the value's size, this threshold number will vary). So we want to store : - a single value most of the case - an array of values when we have more than 1 value, but below N values (N to be evaulated) - a sub-BTree when we have more than N values. Now, we can simplify this scheme in two ways : - we can ditch the single value, and use an array instead, and switch to a sub-BTree at some point - we can decide not to use an array at all, and use a sub-BTree when we ave more than one value - we can also always use a sub-btree when we have an index allowing muliple values. So far, we decided to not go with option 3, as it puts a huge strain on the BTree performances (the very first version was doing exactly taht, and using something different improved the performances by 15%). ATM, we are experimenting option 2, but we may later go for option 1. That beng said, the BTree operations which manipulate the values are the following : - getValue(K) - getValues(K) - insert(K, V) - contains(K, V) - delete(K, V) As we can see, we have two methods which return values, one whichs return one single value, and the other which returns all the values. The question here is why do we have two methods? That forces the users to know when to use one or the other, ie the users must know if the BTree allows multiple values. The suggestion would be to make the getValues() method to return an iterator. That's fine, but what about the semantic of the getValue() method, when we have more than one value stored ? Should it returns the very first value ? IMO, I think we should have the getValues() to return an iterator, and the getValue() method to returns either the single value or the first value if there are more than one. wdyt ? Le 9/16/13 12:10 PM, Pierre-Arnaud Marcelot a écrit : > I would go with a simple Iterator. > > Regards, > Pierre-Arnaud > > On 16 sept. 2013, at 11:51, Emmanuel Lécharny wrote: > >> Le 9/16/13 11:41 AM, Kiran Ayyagari a écrit : >>> On Mon, Sep 16, 2013 at 3:00 PM, Emmanuel Lécharny wrote: >>> >>>> Le 9/14/13 9:45 AM, Kiran Ayyagari a écrit : >>>>> On Sat, Sep 14, 2013 at 1:00 PM, Emmanuel Lécharny >>>> wrote: >>>>> >>>>>> Hi guys, >>>>>> >>>>>> >>>>>> /** >>>>>> * @return The array of stored values. >>>>>> */ >>>>>> V[] getValues(); >>>>>> >>>>>> shouldn't this be returning a BTree? cause we don't support an >>>> array >>>>> as a holder of >>>>> multiple values >>>> The fact that it's stored as a BTree is implementation dependent, I'm >>>> not sure the user of the API should know about it. What the user wants >>>> is to get back the stored values, and this is convenient to get back an >>>> array. >>>> >>>> IMO, it is not convenient, we cannot copy all the values into an array >>> and return, and returning an iterator is not going to help either, both will >>> severely affect the search performance >>> If needed, we can wrap it an immutable structure and return the BTree >>> to prevent direct updates, but I wish to differ this until we see the impact >>> on partition implementation. >> Except that from the BTree interface, the one the users see, you have : >> >> public V get( K key ) throws IOException, KeyNotFoundException >> public DuplicateKeyVal getValues( K key ) throws IOException, >> KeyNotFoundException >> >> so the users has no clue about the underlying data structure. >> >> Now, I'm not sure if it makes any sense to expose a getValues() method >> returning an array. May be an iterator is enough.. >> >> -- >> Regards, >> Cordialement, >> Emmanuel Lécharny >> www.iktek.com >> -- Regards, Cordialement, Emmanuel Lécharny www.iktek.com