directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Karasulu <aok...@bellsouth.net>
Subject Ready for GA? (Was Re: [ApacheDS] [Performance] Using indices to boost search performance)
Date Mon, 11 Sep 2006 18:06:13 GMT
I think with this last optimization for greater capacity the server is 
ready to handle serious datasets and can be made generally available.

Thoughts?

Alex


Alex Karasulu wrote:
> Hi all,
> 
> 
> As promised here are some more statistics as we fill up a single 
> partition up to 2 million entries.
> 
> 
> ---------------
> 750,000 Entries
> ---------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        7.26
> (2)    yes      yes        5.04
> (3)    no       no         565 (9m25s)
> 
> #1-#2 = 2.22
> 
> search results = 1039 entries
> 
> NOTE:
> -----
> I started changing settings a little here on.  I did this because the 
> number of entries returned with cache was way more than what we had 
> allocated.  I changed the memory settings and cache setting from the 
> defaults.  Here's what I did:
> 
>  o up'ed memory from 384 MB to 1024 MB
>  o increased initials cache to 5000 from 100
>  o increased entry cache from 100 to 5000
>  o increased system indices except alias indices from 100 to 1000
> 
> I turned things up a bit so I can load the database as fast as possible 
> since we're increasing the amount added per run to 250K entries now.
> 
> Let's do the 750K again:
> 
> ---------------
> 750,000 Entries
> ---------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        8.22
> (2)    yes      yes        2.61
> (3)    no       no         stopped
> 
> #1-#2 = 5.61
> 
> search results = 1039 entries
> 
> 
> -----------------
> 1,000,000 Entries
> -----------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        10.22
> (2)    yes      yes         3.59
> 
> #1-#2 = 6.63
> search results = 1373 entries
> 
> 
> 
> -----------------
> 1,250,000 Entries
> -----------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        12.71
> (2)    yes      yes         4.50
> 
> #1-#2 = 8.21
> search results = 1738 entries
> 
> 
> -----------------
> 1,500,000 Entries
> -----------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        14.74
> (2)    yes      yes        11.84
> 
> 
> #1-#2 = 2.9
> search results = 2123 entries
> 
> Now let's search this 1.5 million entry partition by limiting the number 
> of returned entries to 1000.
> 
> no cache: 8.10
> w/ cache: 3.04
> 
> #1-#2 = 5.06
> 
> Very similar to the values we got for 0.75 million entries at 1039 
> entries returned.
> 
> For the last value I'll load the partition up to 2 million entries and 
> run the same tests again:
> 
> -----------------
> 2,000,000 Entries
> -----------------
> 
>     [cached] [indexed] [time (seconds)]
> 
> (1)    no       yes        18.58
> (2)    yes      yes        15.38
> 
> 
> #1-#2 = 3.1
> search results = 2851 entries
> 
> Now let's search this 2 million entry partition by limiting the number 
> of returned entries to 1000.
> 
> no cache:   8.04
> w/ cache:   2.55
> #1-#2 = 5.49
> 
> Wow performance for returning 1000 entries is not diminishing as we 
> scale from 1 million to 2 million entries.  Both for cached and 
> non-cached returns.
> 
> Looking at these results I think we can easily accommodate 10 million 
> entries per partition.  Perhaps with a capacity limit on the order of 
> 100 million entries per partition.  Note that you can have as many 
> partitions as you like in a single server.
> 
> Alex
> 
> 
> Alex Karasulu wrote:
>> Hi again,
>>
>> I've completed my optimization on indices under high partition 
>> capacities.  The results are in line and side by side to the old 
>> results.  For those not interested in the exact optimization that took 
>> place skip down to the results.
>>
>>
>> The problem:
>> ------------
>>
>> Up to now duplicate keys (keys with many values) were stored using a 
>> TreeSet within the B+Trees of indices.  Indices usually map some 
>> attribute value to an ID into the master table for an entry.  The ID 
>> is a sequentially unique value assigned to the entry.
>>
>> So if 100 people have initials AA then there would be 100 values in 
>> the TreeSet for the key 'AA' in the intials index.  This would be 
>> serialized and deserialized to and from disk.  At these low numbers 
>> there's really very little impact.  However certain indices were 
>> seriously impacted like the objectClass index or the hierarchy based 
>> system index.  Just imagine having 1 million entries of 
>> inetOrgPersons.  The objectClass index would then have 1 million 
>> values in the TreeSet for the key 'inetOrgPerson'.  This would 
>> seriously hurt performance from both a space and time perspective.  It 
>> would essentially obfuscate the benefits of using BTree's in the first 
>> place with large numbers of entries.
>>
>>
>> Solution:
>> ---------
>>
>> What I did was add a configurable threshold parameter when instead of 
>> using a TreeSet to store duplicates another B+Tree was created and 
>> used within the same index database file.  Right now the default value 
>> for this which these tests were conducted with is 1000.  So after 
>> having 1000 duplicate values the server switches from using in memory 
>> TreeSets to on disk B+Trees for only storing values for that key.
>>
>>
>> Alex Karasulu wrote:
>>> Hi all,
>>>
>>> I've been testing the search performance boost gained from indexing
>>> attributes before starting development on an optimization to improve
>>> index performance and in memory size.  I thought I'd share these
>>> dramatic results of my pre-optimization tests with you since they
>>> clearly show the benefits of indices.
>>>
>>> Before I do this let me list the characteristics of the hardware used
>>> and my configuration settings:
>>>
>>>
>>> -------------
>>> Machine Setup
>>> -------------
>>>
>>> CPU: Dual Athlon MP 1900
>>> OS: Linux 2.6.15
>>> Mem: 2GB 266Mhz
>>>
>>>
>>> --------------
>>> ApacheDS Setup
>>> --------------
>>>
>>> ApacheDS: Stock RC4 (to be released pre-image) w/ modifications
>>>   - Using 1024MB Memory
>>>   - Indexed st and initials
>>>
>>>
>>> ----------
>>> Data Setup
>>> ----------
>>>
>>> Wrote a simple tool to generate random values for descent sized entries.
>>>  The data sort of looks like this for a user entry:
>>>
>>>     dn: uid=user.1,ou=users,dc=example,dc=com
>>>     uid: user.1
>>>     initials: yq
>>>     description: cFocJATNuhlXisDCqGtY
>>>     pager: FYyimqyZRW
>>>     cn: HSGMzajYKmicUTe
>>>     postalcode: WiXXA
>>>     st: xy       street: kpCCqmrsCzkpdtHXWMfY
>>>     l: KqmAXFYTrI
>>>     objectclass: person
>>>     objectclass: organizationalPerson
>>>     objectclass: inetOrgPerson
>>>     sn: nuymgOwpm
>>>     homephone: PERamkCtsv
>>>     mobile: vkIviOGNTC
>>>     telephonenumber: 7248889026
>>>     mail: pYvEoOjSnEymcWD
>>>     givenname: IVHJZB
>>>     postaladdress: crObexKoUTIFdzNHcZMr
>>>     employeenumber: 1
>>>     userpassword:: cGFzc3dvcmQ=
>>>
>>> I started loading a partition up with these entries 100,000 of them at a
>>> time then performing the following searches for all entries with
>>> initials aa:
>>>
>>> (1) index on initials but no cached entries
>>> (2) index on initials with cached entries
>>> (3) no index without cached entries
>>>
>>> Here are the results at the various capacities:
>>>
>>> ---------------
>>> 100,000 Entries
>>> ---------------
>>>
>>>     [cached] [indexed] [time (seconds)]
>>>
>>> (1)    no       yes        3.30
>>                              2.37
>>> (2)    yes      yes        0.72
>>                              0.76
>>> (3)    no       no         30.63
>>                              42.08
>>
>>> search results = 153 entries
>>                    146
>>
>> Notice that if #2 is total time for cached entries we can conclude 
>> that #1 - #2 is a descent measure of pulling from disk.  So let's see 
>> what that number looks like:
>>
>> Old: 2.58
>> New: 1.61
>>
>>>
>>>
>>> ---------------
>>> 200,000 Entries
>>> ---------------
>>>
>>>     [cached] [indexed] [time (seconds)]
>>>
>>> (1)    no       yes        6.04
>>                              3.29
>>> (2)    yes      yes        1.44
>>                              1.49
>>> (3)    no       no         82
>>                              83
>>>
>>> search results = 302 entries
>>                    297
>> #1 - #2 values:
>>
>> Old: 4.60
>> New: 1.80
>>
>>>
>>>
>>> ---------------
>>> 300,000 Entries
>>> ---------------
>>>
>>>     [cached] [indexed] [time (seconds)]
>>>
>>> (1)    no       yes        7.54
>>                              4.11
>>> (2)    yes      yes        1.95
>>                              2.22
>>> (3)    no       no         146
>>                              128
>>>
>>> search results = 451 entries
>>                    435
>>
>> #1 - #2 values:
>>
>> Old: 5.59
>> New: 1.89
>>
>> Before the optimization there seems to be a linear growth to access 
>> time whereas after the optimization the growth rate is almost 
>> unmeasurable above the noise.
>>
>> -Alex
>>
>>
> 
> 
>>>
>>>
>>> ---------------
>>> 400,000 Entries
>>> ---------------
>>>
>>>     [cached] [indexed] [time (seconds)]
>>>
>>> (1)    no       yes        9.24
>>                              4.87
>>> (2)    yes      yes        3.80
>>                              2.69
>>> (3)    no       no         196
>>                              168
>>>
>>> search results = 586 entries
>>                    577
>>
>> #1 - #2 values:
>>
>> Old: 5.44
>> New: 2.18
>>
>>>
>>>
>>> ---------------
>>> 500,000 Entries
>>> ---------------
>>>
>>>     [cached] [indexed] [time (seconds)]
>>>
>>> (1)    no       yes        11.96
>>                               5.06
>>> (2)    yes      yes        3.21
>>                              3.21
>>> (3)    no       no         224
>>                              209
>>>
>>> search results = 748 entries
>>                    706
>>
>> #1 - #2 values:
>>
>> Old: 8.75
>> New: 1.85
>>
>>
>> Besides these numbers I've noticed that adds are now faster and total 
>> garbage collection is lower.  These were eyeballed figures.  For 
>> example the adds would grow in time up to 28 minutes per 100K entries 
>> for those between 400K-500K.  This has dropped down to 18 minutes.
>>
>> Now it would be interesting to see what happens with these trends once 
>> the intials index converts from a TreeSet to a BTree for the values of 
>> the key 'aa'.  So just to see we keep loading up the partition until 
>> there are over 1000 entries with initials 'aa'.
>>
>> I'll report on this tomorrow as the server loads these entries.
>>
>> I guess the conclusion here is that the optimization had a significant 
>> effect and is worth the additional complexity it introduced into the 
>> JDBM partition implementation.  Here's the access times (#1-#2) as a
>> function of the number of entries:
>>
>>
>> Retrieval Times | 100K | 200K | 300K | 400K | 500K
>> --------------------------------------------------
>> Before:         | 2.58 | 4.60 | 5.59 | 5.44 | 8.75
>> After:          | 1.61 | 1.80 | 1.89 | 2.18 | 1.85
>>
>>
>> Before the optimization there seems to be a linear growth to access 
>> time whereas after the optimization the growth rate is almost 
>> unmeasurable above the noise.
>>
>> -Alex
>>
>>
> 
> 


Mime
View raw message