mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Artem Harutyunyan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog
Date Fri, 08 Jul 2016 16:28:11 GMT

     [ https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Artem Harutyunyan updated MESOS-4233:
-------------------------------------
    Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint
29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33,
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37, Mesosphere
Sprint 38, Mesosphere Sprint 39  (was: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere
Sprint 28, Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint
32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36,
Mesosphere Sprint 37, Mesosphere Sprint 38)

> Logging is too verbose for sysadmins / syslog
> ---------------------------------------------
>
>                 Key: MESOS-4233
>                 URL: https://issues.apache.org/jira/browse/MESOS-4233
>             Project: Mesos
>          Issue Type: Epic
>            Reporter: Cody Maloney
>            Assignee: Kapil Arya
>              Labels: mesosphere
>         Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 10 seconds
it will print tens of thousands of log lines, overwhelming syslog (there is a max rate at
which a process can send stuff over a unix socket) and not giving useful information to a
sysadmin who cares about just the high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of log messages,
it can fill up the write buffer in the kernel, and be suspended until the syslog agent catches
up reading from the socket (GLOG does a blocking fwrite to stderr). GLOG also has a big mutex
around logging so only one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went from internal
compoent x to internal component y", from a sysadmin perspective I only care about the high
level actions taken (launched task for framework x), sent offer to framework y, got task failed
from host z. Note those are what I'd expect at the "INFO" level. At the "WARNING" level I'd
expect very little to be logged / almost nothing in normal operation. Just things like "WARN:
Repliacted log write took longer than expected". WARN would also get things like backtraces
on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently overwhelms syslog
with 100k+ messages, many of which are thousands of bytes. Sysadmins expect to be able to
use syslog to monitor basic events in their system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to be reduced
significantly (stderr gets picked up and forwarded to syslog / central aggregation).
> What I would like is if I can set the stderr logging level to be different / independent
from the file logging level (Syslog giving the "sysadmin" aggregated overview, files useful
for debugging in depth what happened in a cluster). A lot of what mesos currently logs at
info is really debugging info / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect are attached,
and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382644
 1315 master.cpp:3248] Launching task envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework
5178f46d-71d6-422f-922c-5bbe82dff9cc-0000 (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: I1215 22:58:29.382925
 1315 master.hpp:176] Adding task envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources
cpus(​*):0.0001; mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part of normal
operation and maybe should be logged at info / debug levels, but not to a sysadmin (Just show
when things fail, and maybe aggregate counters to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would prevent the
giant port list attached, make that easily discoverable / bug filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message