httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Lescohier <daniel.lescoh...@cbsi.com>
Subject Re: thread-safety of time conversion caches in util_time.c and mod_log_config.c
Date Tue, 08 Jan 2013 22:29:31 GMT
Some timing tests:

> gcc -I/usr/include/apr-1 -lrt -lapr-1 -lpthread test.c
> ./a.out
apr_time_exp_lt: 108 nanosecs per call
localtime_r: 85 nanosecs per call
apr_time_exp_gmt: 59 nanosecs per call
gmtime_r: 35 nanosecs per call
pthread_mutex_trylock/unlock/read cache: 19 nanosecs per call
lfence/read cache: 8 nanosecs per call
read cache: 4 nanosecs per call
> cat test.c
#include <time.h>
#include <stdio.h>
#include <assert.h>
#include <apr_time.h>
#include <pthread.h>


int main (int argc, char *argv[]) {
    struct timespec old, new;
    struct tm tmv;
    int i, iterations=10000000;
    long nanosecs;
#define CALC_NS(new, old) ((new.tv_sec - old.tv_sec)*1000000000 +
new.tv_nsec - old.tv_nsec)
    apr_time_t t = apr_time_now();
    apr_time_exp_t xt;
    pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    unsigned int seconds=1;
    typedef struct { char s[44]; unsigned int key;} data_t;
    static volatile data_t cache_data;
    data_t my_copy;

    cache_data.key = 1;

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        apr_time_exp_lt(&xt, t);
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("apr_time_exp_lt: %d nanosecs per call\n", nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        localtime_r(&old.tv_sec, &tmv);
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("localtime_r: %d nanosecs per call\n", nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        apr_time_exp_gmt(&xt, t);
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("apr_time_exp_gmt: %d nanosecs per call\n", nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        gmtime_r(&old.tv_sec, &tmv);
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("gmtime_r: %d nanosecs per call\n", nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        int key = cache_data.key;
        if (seconds==key && key!=0) {
            my_copy = cache_data;
            if (pthread_mutex_trylock(&mutex)==0) {
                key = cache_data.key;
                pthread_mutex_unlock(&mutex);
            }
        }
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("pthread_mutex_trylock/unlock/read cache: %d nanosecs per
call\n", nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        int key = cache_data.key;
        if (seconds==key && key!=0) {
            my_copy = cache_data;
            asm volatile("lfence":::"memory");
            key = cache_data.key;
        }
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("lfence/read cache: %d nanosecs per call\n",
nanosecs/iterations);

    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &old)==0);
    for (i=0; i < iterations; ++i) {
        int key = cache_data.key;
        if (seconds==key && key!=0) {
            my_copy = cache_data;
            key = cache_data.key;
        }
    }
    assert(clock_gettime(CLOCK_MONOTONIC_RAW, &new)==0);
    nanosecs = CALC_NS(new, old);
    printf("read cache: %d nanosecs per call\n", nanosecs/iterations);
}


On Tue, Jan 8, 2013 at 1:50 PM, Daniel Lescohier
<daniel.lescohier@cbsi.com>wrote:

> It looks like apr_atomic is not in good-enough shape to be used for
> lock-free algorithms.  It's probably good enough for the use-cases that the
> event mpm uses it for, because the event mpm is not using it for protecting
> memory besides the counters it's atomically changing.  However, the time
> caches would use the cache key to protect other memory (the cache value),
> so apr_atomic is not good enough, because it's inconsistent on the
> different platforms with respect to memory barriers.
>
> For example, with x86, the add32 and cas32 functions act as a full memory
> barrier on all memory.  However, for powerpc, the add32 and cas32 only acts
> as a memory barrier on the individual item being changed, it doesn't act as
> a memory barrier on all memory.  One would have to add lightweight sync
> assembler instructions to the add32 and cas32 function implementations for
> powerpc if you want the same semantics as x86.
>
> The problem is that the apr_atomic API is not explicit with the memory
> ordering semantics.  I agree with you that in the future the apr_atomic API
> should be changed to be more like the C11 atomic API, which allows explicit
> choice of memory ordering.
>
> But that leaves us with the problem of the current time caching code,
> which is not thread-safe.  We should fix it using the existing APR APIs,
> and not use the apr_atomic API which is broken for this use-case.
>
> I found that there is a lock-free method available to us with the current
> APR API.  We can use apr_thread_mutex_trylock.  In glibc, that just
> attempts to cmpxchg the userspace futex, and if it fails, it doesn't do a
> futex system call and a wait on the caller, so it's lock free.  On Windows,
> TryEnterCriticalSection probably works the same way.
>
> So, the pseudo-code would be something like:
>
> For reading from cache:
>
>     apr_uint32_t key = cache_element->key;
>     /* Above is done speculatively */
>
>     if (seconds == key && seconds != 0) {
>         /* seconds == 0 may mean cache is uninitialized, so don't use
> cache */
>         *xt = cache_element->xt;
>         /* After copying xt, make sure cache_element was not marked invalid
>          * by another thread beginning an update, and that cache_element
>          * really contained data for our second. */
>         if (apr_thread_mutex_trylock(cache_element->mutex) == APR_SUCCESS)
> {
>             key = cache_element->key;
>             apr_thread_mutex_unlock(cache_element->mutex);
>             if (key == seconds) {
>
>                 xt->tm_usec = (int)apr_time_usec(t);
>                 return APR_SUCCESS;
>             }
>         }
>     }
>     /* calculate the value if reached here */
>
> Write to cache code:
>
>         if (apr_thread_mutex_trylock(cache_element->mutex) == APR_SUCCESS)
>         {  /* We won the race to update this cache_element. */
>             cache_element->key = ~seconds;
>             /* Above marks cache_element as invalid by using ~seconds,
>              * for the speculative reads not using the mutex. */
>             cache_element->xt = *xt;
>             /* Finished copying, now update key with our key */
>             cache_element->key = seconds;
>             apr_thread_mutex_unlock(cache_element->mutex);
>         }
>
> The above is lock-free, because process scheduling decisions by the OS
> won't cause the current thread to block, it'll keep on doing work without
> blocking as long as this thread is scheduled.  It would be fast, though
> perhaps not quite as fast than if we had an atomic api we could use
> effectively.  With an atomic API, the read-from-time-cache code would only
> be reading from cache-lines, not writing to cache-lines with the futex's
> cmpxchg instruction.  You might also be able to use a lighter-weight memory
> barrier.  But the mutex is lock-free and fast, and it's something that'd
> work with today's APIs; and the existing code in production is not correct.
>
> BTW, I looked at rwlock_trylock, and that isn't lock-free in glibc,
> because the readers/writers count variables are protected with a "low-level
> lock" futex that is acquired with lll_lock, not lll_trylock.  So, if a
> thread acquires the low-level lock, but is unscheduled by the OS before it
> finishes updating the reader/writers count variables and unlocks the
> "low-level lock", that would block other threads from acquiring the
> low-level lock, and other threads would block before being able to access
> the readers/writers count variables.  If it had used lll_trylock instead, I
> think it would've been lock-free.  However, I don't think rwlock would give
> us any particular advantage than a regular mutex for the above algorithm.
>
> So using the simpler mutex is better and is lock-free.  I think the
> important part is to have a lock-free algorithm: if we lose a little bit of
> performance by not being able to use the lightest lock-free algorithm, I
> think that is OK.
>
>
> On Mon, Jan 7, 2013 at 5:50 PM, Stefan Fritsch <sf@sfritsch.de> wrote:
>
>> On Sunday 06 January 2013, Daniel Lescohier wrote:
>> > I'm not sure we should include memory barriers inside the
>> > apr_atomic_read32 and apr_atomic_set32 functions, because you
>> > don't necessarily need a full memory barrier on each read or set.
>> > Instead, perhaps we should add three functions to APR:
>> >
>> >    1. apr_atomic_rmb(): read memory barrier.
>> >    2. apr_atomic_wmb(): write memory barrier.
>> >    3. apr_atomic_mb(): full memory barrier.
>>
>> Introducing new functions into APR may not be the perfect solution for
>> this problems because there will be some reluctance to bump the
>> minimum apr version required for httpd, especially for 2.2.x. Also,
>> writing such functions for all supported architectures and compilers
>> is quite some effort.
>>
>> In any case, if we introduce new atomic and/or barrier functions, I
>> would be in favor of something that aligns with a subset of the C11
>> atomic API.
>>
>> An alternative that would work without changes to apr would be per-
>> thread caches for cached_explode(). This would at least not be less
>> efficient than the current code is with mpm-prefork. Or one could do
>> that only if compiled with an apr that does not have the new barrier
>> functions...
>>
>
>

Mime
View raw message