I sent an Email a while back asking about a problem with the nested mutexes and cond vars and got back a reply but I've now stumbled accross a case that bites unnested mutexes.

If the caller of apr_cond_notify() (or anyone else for that matter) explicitly locks the mutex while the caller to await is asleep the APR mutex is trashed because the apr owner field has been trashed.

This is legal in pthreads (and in fact using nested mutexes is) and is a common idiom to wakeup a waiting process but guarentee it doesn't run until some later point when you're ready.

A simple case that gets you lossage is:
"signaler"

apr_thread_mutex_lock(mymutex)
apr_thread_cond_signal(mymutex)
/* Do other stuff you want to happen before client runs */
apr_thread_mutex_unlock(mymutex)

"waiter"

apr_thread_mutex_lock(mymutex)
/* Do stuff that has to happen safely before I sleep but after I get the mutex */
apr_thread_cond_wait(mycond, mymutex)
apr_thread_mutex_unlock() <== Fails because mutex->owner is NULL

This sort of thing is most interesting where it's a nested mutex inside library code that may or may not be called with the mutex held - you can have a utility function that wakes up the another thread but maybe the caller also held the mutex to do other bits under its protection.

The key issue is that pthread_cond_signal/pthread_cond_await/timedwait guarentee that the thread atomically sleeps and unlocks the mutex and then atomically wakes up and reacquires the mutex in the same state it left it.  APR doesn't do this - the simple call through to the pthread_cond_xx functions breaks the atomicity/integrity of the mutex/cond_var pair and requires some massaging to emulate.

The general issue is that when the waiter is "asleep" the state of the APR 1/2 of the mutex indicates that the mutex is held by the awaiting thread when in fact the mutex is not held by anyone.  This is general issue where you share a mutex between a condition variable and protecting "other" external bits.
--
Mark

I do not believe nested mutexes and condition variables are
compatible.

Do you really need nested mutexes?

-aaron


On Thursday, August 21, 2003, at 02:16  PM, Mark Gooderum wrote:

I've discovered what I think it a bug in the pthreads based cond_wait() code when used with nested mutexes.



I've discovered what I think it a bug in the pthreads based cond_wait() code when used with nested mutexes.

The "proper" (per my experience) way to use a condition variable in the pthreads world is a squence like:

Waiter            Notifier
----------        -------------
mutex_lock
cond_wait
(pthreads atomically surrenders lock and waits)
                   mutex_lock
                   cond_signal
                   mutex_unlock
(pthread atomically requires the mutex and wakes up)
mutex_unlock

But if the waiter code is for instance in a library and is called with the nested mutex already locked the lock/cond_wait/unlock cycle always clears the mutex.  So a cycle like:

Waiter            Notifier
----------        -------------
mutex_lock
mutex_lock
cond_wait
(pthreads atomically surrenders lock and waits)
                   mutex_lock
                   cond_signal
                   mutex_unlock
(pthread atomically requires the mutex and wakes up)
mutex_unlock
mutex_unlock


is broken - Details follow - what happens is the mutex_lock/unlock cycle in the notifier trashes the APR portions of the mutex.
APR_DECLARE(apr_status_t) apr_thread_cond_wait(apr_thread_cond_t *cond,
                                               apr_thread_mutex_t *mutex)
{
    apr_status_t rv;

    rv = pthread_cond_wait(cond->cond, &mutex->mutex);
#ifdef PTHREAD_SETS_ERRNO
    if (rv) {
        rv = errno;
    }
#endif
}
The owner and owner_ref fields of the APR mutex structure are left alone.  The caller does a mutex_lock which works and destroys the owner/owner_ref fields:
APR_DECLARE(apr_status_t) apr_thread_mutex_lock(apr_thread_mutex_t *mutex)
{
    apr_status_t rv;

    if (mutex->nested) {
---->This doesn't hit because owner is the other waiting thread:
        if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {  
            mutex->owner_ref++;
            return APR_SUCCESS;
        }
--->So we succeed in "stealing" the mutex
        rv = pthread_mutex_lock(&mutex->mutex);
        if (rv) {
#ifdef PTHREAD_SETS_ERRNO
            rv = errno;
#endif
            return rv;
        }

        mutex->owner = apr_os_thread_current();
        mutex->owner_ref = 1;
        return rv;
    }
    else {
        rv = pthread_mutex_lock(&mutex->mutex);
#ifdef PTHREAD_SETS_ERRNO
        if (rv) {
            rv = errno;
        }
#endif
        return rv;
    }
}
When the notifier than calls mutex_unlock() the pthread mutex is released and the mutex owner/owner_ref fields zeroed out.

So now the waiter thread gets the pthread mutex inside pthread_cond_wait() and wakes up.  The call returns to the caller.  The caller then trys to release the mutex - this call succeeds because we always release the mutex.  But the mutex is now fully released  - so if the caller is counting on holding the mutex with nesting he's hosed.
APR_DECLARE(apr_status_t) apr_thread_mutex_unlock(apr_thread_mutex_t *mutex)
{
    apr_status_t status;

    if (mutex->nested) {
        ---->This fails so it just releases the pthread_mutex (incorrectly if my original nest count was > 1)
        if (apr_os_thread_equal(mutex->owner, apr_os_thread_current())) {
            mutex->owner_ref--;
            if (mutex->owner_ref > 0)
                return APR_SUCCESS;
        }
        status = pthread_mutex_unlock(&mutex->mutex);
        if (status) {
#ifdef PTHREAD_SETS_ERRNO
            status = errno;
#endif
            return status;
        }

        memset(&mutex->owner, 0, sizeof mutex->owner);
        mutex->owner_ref = 0;
        return status;
    }
    else {
        status = pthread_mutex_unlock(&mutex->mutex);
#ifdef PTHREAD_SETS_ERRNO
        if (status) {
            status = errno;
        }
#endif
        return status;
    }
}
The fix seems to be to save and restore the owner/owner_ref fields around the pthread_cond_wait() call.  The same thing applies to the timed version of the wait but I used the shorter version here for clarity.  P.S.. - I'm not subscribed to the dev list so please CC me on replies, thanks.
--
Mark Gooderum
Vernier Networks, Inc.
mark@verniernetworks.com