incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randall Leeds (JIRA)" <j...@apache.org>
Subject [jira] Updated: (COUCHDB-761) Timeouts in couch_log are masked, crashes callers
Date Sat, 22 May 2010 00:38:17 GMT

     [ https://issues.apache.org/jira/browse/COUCHDB-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Randall Leeds updated COUCHDB-761:
----------------------------------

    Attachment: improved-sync-logging-v2.patch

Here's the second version of the patch.

Changes:
1) I took Adam's advice and simplified the message format for messages sent by the ?LOG_*
macros.
2) I re-ordered the handle_event clauses. The first few handle the new log macro message format,
the next few handle messages coming from sasl
3) I took this opportunity to add a config direction log:include_sasl which defaults to true.
When include_sasl is false the couch_log process will ignore sasl messages. I thought about
making this dependent on log level somehow but decided I prefer to give people options and
have sensible defaults.

What this accomplishes:
1) synchronous logging - log messages are never missed since we use sync_notify now instead
of the notify call error_logger uses
2) log macros for log levels that are disabled should have less performance impact. I don't
know and haven't tested how an ets lookup compares to a gen_server call but at least responses
can be fast even when the log server is busy formatting a long message
3) eliminating the gen:call for couch_log:log_level_integer/0 means no more unexpected {#Ref<...>,
LogLevel} messages arriving after a timeout due to a slow or busy log server process
4) sasl logging can be disabled in production environments where the entire state of crashing
gen_servers is probably waaay TMI when we probably ?LOG_ERROR a more useful message anyway

Open for re-review, comments, questions.


> Timeouts in couch_log are masked, crashes callers
> -------------------------------------------------
>
>                 Key: COUCHDB-761
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-761
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Database Core
>    Affects Versions: 0.10.1, 0.10.2, 0.11
>            Reporter: Randall Leeds
>            Priority: Blocker
>             Fix For: 0.10.3, 0.11.1, 1.0
>
>         Attachments: improved-sync-logging-v2.patch, improved-sync-logging.patch
>
>
> Several users have reported seeing crash reports stemming from a function_clause match
on handle_info in various gen_servers. The offending message looks like {#Ref<>, <integer>}.
> After months of banter and sleuthing, I determined that the likely cause was a late reply
to a gen_server:call that timed out, with the #Ref being the tag on the response. After it
came up again today in IRC, kocolosk quickly discovered that the problem appears to be in
couch_log.erl.
> The logging macros (?LOG_*)  call couch_log/*_on which calls get_level_integer/0. When
this call times out the timeout is eaten and a late reply arrives to the calling process later,
triggering the crash.
> Suggestions on how to fix this welcome. Ideas so far are async logging or infinite timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message