directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <akaras...@apache.org>
Subject Re: Add perf issues
Date Mon, 10 May 2010 08:43:59 GMT
On Mon, May 10, 2010 at 11:31 AM, Emmanuel L├ęcharny <elecharny@apache.org>wrote:

> On 5/10/10 10:03 AM, Alex Karasulu wrote:
>
>>
>> I'm starting to look into the switch over to using secondary BTree's for
>> duplicates after the threshold is reached to make sure this is working
>> properly.  I remember having a hard time writing a test case to make sure
>> that this switch over occurs before because this switch over is an
>> implementation detail internal to the index implementation which is not
>> noticeable to the outside world (callers).
>>
>>
> What we shoudl also do is to determinate the best value for this
> numDuplicate value. It will depend on many factors, and we may want to be
> able to run a quick test on a target computer to set it to the best value.
>
> For instance, I can imagine that the value will be different if the value
> is a Long or a String or a RDN. Right now, the default value is 512, and I
> don't think it fits in all cases.
>
> It would be very convenien to have a CL tool that you can run before
> starting the server and on a specific computer/FS will give you the correct
> configuration. It will also depend on the underlying partition (Jdbm,
> Oracle, etc...)
>
> Just thinking loud here ...
>
>
That's an interesting idea with respect to this command line tool. However
note that it's not only dependent on the computer or the file system but
mostly on the data set. Consider this scenario where the duplicate keys is
large for all keys but not quite at the threshold. This will cause more
memory usage, more serialization (CPU utilization) and greater IO overheads.
In this case it might be beneficial to drop the threshold to reduce these
frequent expensive operations.

On the other hand consider a data set where out of a massive number of key
values only a few are heavily loaded with duplicates. In this case a larger
threshold is more beneficial.

Point now is that we will never know. However in the future an adaptive
algorithm can be used to adjust this factor based on the circumstances
presented by the data set and that might be our best approach.

Regards,
-- 
Alex Karasulu
My Blog :: http://www.jroller.com/akarasulu/
Apache Directory Server :: http://directory.apache.org
Apache MINA :: http://mina.apache.org
To set up a meeting with me: http://tungle.me/AlexKarasulu

Mime
View raw message