httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Lescohier <daniel.lescoh...@cbsi.com>
Subject Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
Date Tue, 08 Jan 2013 18:50:14 GMT
It looks like apr_atomic is not in good-enough shape to be used for
lock-free algorithms.  It's probably good enough for the use-cases that the
event mpm uses it for, because the event mpm is not using it for protecting
memory besides the counters it's atomically changing.  However, the time
caches would use the cache key to protect other memory (the cache value),
so apr_atomic is not good enough, because it's inconsistent on the
different platforms with respect to memory barriers.

For example, with x86, the add32 and cas32 functions act as a full memory
barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
as a memory barrier on the individual item being changed, it doesn't act as
a memory barrier on all memory.  One would have to add lightweight sync
assembler instructions to the add32 and cas32 function implementations for
powerpc if you want the same semantics as x86.

The problem is that the apr_atomic API is not explicit with the memory
ordering semantics.  I agree with you that in the future the apr_atomic API
should be changed to be more like the C11 atomic API, which allows explicit
choice of memory ordering.

But that leaves us with the problem of the current time caching code, which
is not thread-safe.  We should fix it using the existing APR APIs, and not
use the apr_atomic API which is broken for this use-case.

I found that there is a lock-free method available to us with the current
APR API.  We can use apr_thread_mutex_trylock.  In glibc, that just
attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a
futex system call and a wait on the caller, so it's lock free.  On Windows,
TryEnterCriticalSection probably works the same way.

So, the pseudo-code would be something like:

For reading from cache:

    apr_uint32_t key = cache_element->key;
    /* Above is done speculatively */
    if (seconds == key && seconds != 0) {
        /* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
        *xt = cache_element->xt;
        /* After copying xt, make sure cache_element was not marked invalid
         * by another thread beginning an update, and that cache_element
         * really contained data for our second. */
        if (apr_thread_mutex_trylock(cache_element->mutex) == APR_SUCCESS) {
            key = cache_element->key;
            apr_thread_mutex_unlock(cache_element->mutex);
            if (key == seconds) {
                xt->tm_usec = (int)apr_time_usec(t);
                return APR_SUCCESS;
            }
        }
    }
    /* calculate the value if reached here */

Write to cache code:

        if (apr_thread_mutex_trylock(cache_element->mutex) == APR_SUCCESS)
        {  /* We won the race to update this cache_element. */
            cache_element->key = ~seconds;
            /* Above marks cache_element as invalid by using ~seconds,
             * for the speculative reads not using the mutex. */
            cache_element->xt = *xt;
            /* Finished copying, now update key with our key */
            cache_element->key = seconds;
            apr_thread_mutex_unlock(cache_element->mutex);
        }

The above is lock-free, because process scheduling decisions by the OS
won't cause the current thread to block, it'll keep on doing work without
blocking as long as this thread is scheduled.  It would be fast, though
perhaps not quite as fast than if we had an atomic api we could use
effectively.  With an atomic API, the read-from-time-cache code would only
be reading from cache-lines, not writing to cache-lines with the futex's
cmpxchg instruction.  You might also be able to use a lighter-weight memory
barrier.  But the mutex is lock-free and fast, and it's something that'd
work with today's APIs; and the existing code in production is not correct.

BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc, because
the readers/writers count variables are protected with a "low-level lock"
futex that is acquired with lll_lock, not lll_trylock.  So, if a thread
acquires the low-level lock, but is unscheduled by the OS before it
finishes updating the reader/writers count variables and unlocks the
"low-level lock", that would block other threads from acquiring the
low-level lock, and other threads would block before being able to access
the readers/writers count variables.  If it had used lll_trylock instead, I
think it would've been lock-free.  However, I don't think rwlock would give
us any particular advantage than a regular mutex for the above algorithm.

So using the simpler mutex is better and is lock-free.  I think the
important part is to have a lock-free algorithm: if we lose a little bit of
performance by not being able to use the lightest lock-free algorithm, I
think that is OK.

On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch <sf@sfritsch.de> wrote:

> On Sunday 06 January 2013, Daniel Lescohier wrote:
> > I'm not sure we should include memory barriers inside the
> > apr_atomic_read32 and apr_atomic_set32 functions, because you
> > don't necessarily need a full memory barrier on each read or set.
> > Instead, perhaps we should add three functions to APR:
> >
> >    1. apr_atomic_rmb(): read memory barrier.
> >    2. apr_atomic_wmb(): write memory barrier.
> >    3. apr_atomic_mb(): full memory barrier.
>
> Introducing new functions into APR may not be the perfect solution for
> this problems because there will be some reluctance to bump the
> minimum apr version required for httpd, especially for 2.2.x. Also,
> writing such functions for all supported architectures and compilers
> is quite some effort.
>
> In any case, if we introduce new atomic and/or barrier functions, I
> would be in favor of something that aligns with a subset of the C11
> atomic API.
>
> An alternative that would work without changes to apr would be per-
> thread caches for cached_explode(). This would at least not be less
> efficient than the current code is with mpm-prefork. Or one could do
> that only if compiled with an apr that does not have the new barrier
> functions...
>

Mime
View raw message