directory-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel Lécharny <elecha...@gmail.com>
Subject Re: JDBM + MVCC LRUCache concern, take 2
Date Thu, 26 Apr 2012 07:48:22 GMT
Le 4/26/12 2:08 AM, Selcuk AYA a écrit :
> On Wed, Apr 25, 2012 at 4:45 PM, Emmanuel Lécharny<elecharny@gmail.com>  wrote:
>> Le 4/5/12 1:09 AM, Emmanuel Lécharny a écrit :
>>> Le 4/5/12 12:43 AM, Selcuk AYA a écrit :
>>>> On Wed, Apr 4, 2012 at 3:22 PM, Emmanuel Lécharny<elecharny@gmail.com>
>>>>   wrote:
>>>>> It's systematic, and I guess that the fact we now pond the RdnIndex
>>>>> table
>>>>> way more often than before (just because we don't call anymore the
>>>>> OneLevelIndex) cause the cache to get filled and not released fast
>>>>> enough.
>>>> do we hold a cursor open while this code gets stuck? I would think we
>>>> hold a cursor open and moduify quite a bit of jdbm btree pages for
>>>> this kind of behavior to happen.
>>>
>>> I'll check that.
>>>>> As we don't set any size for the cache, its default size is 1024. For
>>>>> some
>>>>> of the tests, this mightnot be enough, as we load a lot of entries
>>>>> (typically the schema elements) plus many others that get added and
>>>>> removed
>>>>> while running tests in revert mode.
>>>>>
>>>>> If I increase the default size to 65536, the tests are passing.
>>>>>
>>>>> Ok, now, I have to admit I haven't - yet - looked at the LRUCache code,
>>>>> and
>>>>> my analysis is just based on what I saw by quickly looking at the code,
>>>>> the
>>>>> stack traces I have added and some few blind guesses.
>>>>> However, I think we have a serious issue here. As far as I can tel, the
>>>>> code
>>>>> itself is probably not responsible for this behaviour, but the way we
>>>>> use it
>>>>> is.
>>>>>
>>>>> Did I missed something ? Is there anything we can do - except increase
>>>>> the
>>>>> cache size - to get the tests passing fine ?
>>>>>
>>>>> I'm more concern about what could occur in real life, when some users
>>>>> will
>>>>> load the server up to a point it just stop responding...
>>>>   to aovid this issue, we can let the writers allocate more cache
>>>> pages(rather than keeping the cache size fixed) so that they do not
>>>> loop waiting for a replaceable cache. However, I would again suggest
>>>> making sure we do not forget the cursor open. If we forget a cursor
>>>> open and keep allocating new cache pages for writes, we will have
>>>> other problems.
>>> Yeah, I can see how it may affect the tests. I'll definitively investigate
>>> this first, before going any further in another direction.
>>>
>>> ATM, I'm using a not committed version of JDBM were the default cache size
>>> has been changed.
>>>
>>> Thanks a lot Selcuk !
>>
>> So I still have the LRUCache size issue, after having removed the SubLevel
>> index. Once I increased the size to 1<<  16, tests are passing.
>>
>> The failing tests are the SearchAuthorizationIT class' tests.
>>
>> What happens is that when I add an entry, I update many elements in the
>> RdnIndex, as I have to modify the nbDescendant in all its parents. As those
>> tests are injecting a lot of entries, so they do a lot of modifications in
>> the RdnIndex.
>>
>> I checked that all the cursors are correctly closed.
>>
>> Any clue ?
> Can you provide the following details:
> -on which jdbm table are you having the problem(rdn index, main table?)?
> - approximately how many modifications are you doing on this table
> while you are holding a cursor open( even if the cursor is held open
> legally). Knowing this number would help a lot.
I'll add some logs to get those numbers.

> - is the same problem you had before or did closing the cursors in the
> previous case solve your problem?
Yes, absolutely. But I know for sure that increasing the cache size 
solved the issue, and that closing the cursors also solved the issue 
(this is not corelated, both fixed the issue), however, those fixes 
might perfectly hide some other issue.

I have also conducted some tests where I do some concurrent searches and 
modifications, without any problem.

Keep in mind that the tests are quite specific, as we run them on a 
direct connection to the server (which is way more stressing for the 
server, as we can do 25Ksearches/s), and we do a lot of concurrent 
operations (as tests are run in parallel).
>
> this kind of problem can currently occur if a thread is holding a
> cursor on one table(not necessarily illegally) and the "same" thread
> is modifying the "same" table with many add/delete/update operations.
> I am wondering whether we have a use case like this now. If we have,
> then I can change the code to account for it.
I wonder if this is not what happens. The RdnIndex is used to store the 
folowing relationships :
forward index : ParentIdAndRdn -> entryID
reverse index : entryId -> ParentIdAndRdn

When we add or delete an entry, in order to use the rdnIndex to replace 
the subLevel index, we update all the ParentIdAndRdn elements up to the 
partition root to update the nbChildren/nbDescendant fields in each of 
them. That means we may have many entries in both table being modified, 
for each addition.

In the tests I'm running, we typically add entries like :
ou=0, ou=0, ou=0, ou=tests, ou=system.

For every entry like this one, we will update 5 ParentIdAndRdn elements, 
doing a drop and a add (sadly, we can't simply replace the element, but 
that's another story).

As we revert the modifications at the end, we may have hundreds of 
modifications done. Something I don't get though is that the cursors are 
supposed to be carefully closed. I'll recheck that today.

I will also implement your suggestion (counting the cursor 
opening/closing to compare the result at the end).

I'll investigate more today.

Still I have a question about this cache : if it's a cache, why do we 
block waiting for some slot to be available, instead of trashing the 
oldest entry in the cache ? I mean, cache are never supposed to block, 
either they provide the required element, or they fetch it from the slow 
storage, no ?
Otherwise, if this is used to manage some temporary elements, until they 
are flushed to disk, and if we can't discard those elements otherwise we 
lose some critical information, then it's not really a LRUCache, and we 
should find a better name... (just thinking out loud here, I'm not very 
familiar with all the concepts behind implemented in JDBM...)

Thanks Selcuk !

-- 
Regards,
Cordialement,
Emmanuel Lécharny
www.iktek.com


Mime
View raw message