On Mon, May 10, 2010 at 11:31 AM, Emmanuel Lécharny <email@example.com>
What we shoudl also do is to determinate the best value for this numDuplicate value. It will depend on many factors, and we may want to be able to run a quick test on a target computer to set it to the best value.
On 5/10/10 10:03 AM, Alex Karasulu wrote:
I'm starting to look into the switch over to using secondary BTree's for
duplicates after the threshold is reached to make sure this is working
properly. I remember having a hard time writing a test case to make sure
that this switch over occurs before because this switch over is an
implementation detail internal to the index implementation which is not
noticeable to the outside world (callers).
For instance, I can imagine that the value will be different if the value is a Long or a String or a RDN. Right now, the default value is 512, and I don't think it fits in all cases.
It would be very convenien to have a CL tool that you can run before starting the server and on a specific computer/FS will give you the correct configuration. It will also depend on the underlying partition (Jdbm, Oracle, etc...)
Just thinking loud here ...
That's an interesting idea with respect to this command line tool. However note that it's not only dependent on the computer or the file system but mostly on the data set. Consider this scenario where the duplicate keys is large for all keys but not quite at the threshold. This will cause more memory usage, more serialization (CPU utilization) and greater IO overheads. In this case it might be beneficial to drop the threshold to reduce these frequent expensive operations.
On the other hand consider a data set where out of a massive number of key values only a few are heavily loaded with duplicates. In this case a larger threshold is more beneficial.
Point now is that we will never know. However in the future an adaptive algorithm can be used to adjust this factor based on the circumstances presented by the data set and that might be our best approach.