couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kocoloski (JIRA)" <>
Subject [jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers
Date Fri, 14 May 2010 01:27:42 GMT


Adam Kocoloski updated COUCHDB-761:

    Priority: Blocker  (was: Major)

Thanks for filing this ticket, Randall.  I'm bumping it to Blocker.

I'd discourage fully async logging.  I've tried it in the past; it's far too easy to overwhelm
the error_logger process with debug messages.  Eventually the error_logger mailbox exhausts
the available memory and the VM dies a horrible death.

Infinite timeouts are a viable option in my opinion.  Another option is to spawn a function
to log the message:

- doesn't block the original process

- spends extra CPU cycles copying data to new process heap
- potential to exhaust process limit

Personally, I don't think it's worth the risk.  Here's what I'd propose:

1) Reimplement debug_on(), info_on() to use ets table lookups.  This is pretty easy because
the log level is already stored in couch_config.

2) If the log level is enabled, use an infinite timeout to log the message.

This way we can suppress the LOG_DEBUG messages without slowing down request processing by
more than a few ┬Ás, we fix the crashes implicated in this ticket, and we keep the error_logger
mailbox small.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>                 Key: COUCHDB-761
>                 URL:
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
> Several users have reported seeing crash reports stemming from a function_clause match
on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply
to a gen_server:call that timed out, with the #Ref being the tag on the response. After it
came up again today in IRC, kocolosk quickly discovered that the problem appears to be in
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When
this call times out the timeout is eaten and a late reply arrives to the calling process later,
triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message