incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kocoloski (JIRA)" <j...@apache.org>
Subject [jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers
Date Fri, 14 May 2010 01:27:42 GMT

     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Adam Kocoloski updated COUCHDB-761:
-----------------------------------

    Priority: Blocker  (was: Major)

Thanks for filing this ticket, Randall.  I'm bumping it to Blocker.

I'd discourage fully async logging.  I've tried it in the past; it's far too easy to overwhelm
the error_logger process with debug messages.  Eventually the error_logger mailbox exhausts
the available memory and the VM dies a horrible death.

Infinite timeouts are a viable option in my opinion.  Another option is to spawn a function
to log the message:

Pros:
- doesn't block the original process

Cons:
- spends extra CPU cycles copying data to new process heap
- potential to exhaust process limit

Personally, I don't think it's worth the risk.  Here's what I'd propose:

1) Reimplement debug_on(), info_on() to use ets table lookups.  This is pretty easy because
the log level is already stored in couch_config.

2) If the log level is enabled, use an infinite timeout to log the message.

This way we can suppress the LOG_DEBUG messages without slowing down request processing by
more than a few ┬Ás, we fix the crashes implicated in this ticket, and we keep the error_logger
mailbox small.

> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>
> Several users have reported seeing crash reports stemming from a function_clause match
on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply
to a gen_server:call that timed out, with the #Ref being the tag on the response. After it
came up again today in IRC, kocolosk quickly discovered that the problem appears to be in
couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When
this call times out the timeout is eaten and a late reply arrives to the calling process later,
triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message