kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Guozhang Wang <wangg...@gmail.com>
Subject Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics
Date Tue, 03 Apr 2018 18:40:44 GMT
1. If we can indeed gather all the context information from the log4j
entries I'd suggest we change to thread-level (I'm not sure if that is
doable, so if John have already some WIP PR that can help us decide).

2. We can consider adding the API in TopologyTestDriver for general testing
purposes; that being said, I think Matthias has a good point that this
alone should not be a driving motivation for us to keep this metric as
task-level if 1) is true.



Guozhang


On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <matthias@confluent.io>
wrote:

> Thanks Guozhang, that was my intent.
>
> @John: yes, we should not nail down the exact log message. It's just to
> point out the trade-off. If we can get the required information in the
> logs, we might not need task level metrics.
>
>
> -Matthias
>
> On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > I think Matthias' comment is that, we can still record the metrics on the
> > thread-level, while having the WARN log entry to include sufficient
> context
> > information so that users can still easily narrow down the investigation
> > scope.
> >
> >
> > Guozhang
> >
> > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <john@confluent.io> wrote:
> >
> >> I agree we should add as much information as is reasonable to the log.
> For
> >> example, see this WIP PR I started for this KIP:
> >>
> >> https://github.com/apache/kafka/pull/4812/files#diff-
> >> 88d129f048bc842c7db5b2566a45fce8R80
> >>
> >> and
> >>
> >> https://github.com/apache/kafka/pull/4812/files#diff-
> >> 69e6789eb675ec978a1abd24fed96eb1R111
> >>
> >> I'm not sure if we should nail down the log messages in the KIP or in
> the
> >> PR discussion. What say you?
> >>
> >> Thanks,
> >> -John
> >>
> >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matthias@confluent.io
> >
> >> wrote:
> >>
> >>> Thanks for sharing your thoughts. As I mentioned originally, I am not
> >>> sure about the right log level either. Your arguments are convincing --
> >>> thus, I am fine with keeping WARN level.
> >>>
> >>> The task vs thread level argument is an interesting one. However, I am
> >>> wondering if we should add this information into the corresponding WARN
> >>> logs that we write anyway? For this case, we can also log the
> >>> corresponding operator (and other information like topic name etc if
> >>> needed). WDYT about this?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> >>>> Regarding logging: I'm inclined to keep logging at WARN level since
> >>> skipped
> >>>> records are not expected in normal execution (for all reasons that we
> >> are
> >>>> aware of), and hence when error happens users should be alerted from
> >>>> metrics and looked into the log files, so to me if it is really
> >> spamming
> >>>> the log files it is also a good alert for users. Besides for
> >> deserialize
> >>>> errors we already log at WARN level for this reason.
> >>>>
> >>>> Regarding the metrics-levels: I was pondering on that as well. What
> >> made
> >>> me
> >>>> to think and agree on task-level than thread-level is that for some
> >>> reasons
> >>>> like window retention, they may possibly be happening on a subset of
> >>> input
> >>>> partitions, and tasks are correlated with partitions the task-level
> >>> metrics
> >>>> can help users to narrow down on the specific input data partitions.
> >>>>
> >>>>
> >>>> Guozhang
> >>>>
> >>>>
> >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <john@confluent.io>
> >> wrote:
> >>>>
> >>>>> Hi Matthias,
> >>>>>
> >>>>> No worries! Thanks for the reply.
> >>>>>
> >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver
to
> >>> write
> >>>>> a quick test exercising the current behavior and discovered that
the
> >>>>> metrics weren't available. It seemed like they should be, so I tacked
> >>> it on
> >>>>> to this KIP. If you feel it's inappropriate, I can pull it back
out.
> >>>>>
> >>>>> 2) I was also concerned about that, but I figured it would come
up in
> >>>>> discussion if I just went ahead and proposed it. And here we are!
> >>>>>
> >>>>> Here's my thought: maybe there are two classes of skips: "controlled"
> >>> and
> >>>>> "uncontrolled", where "controlled" means, as an app author, I
> >>> deliberately
> >>>>> filter out some events, and "uncontrolled" means that I simply don't
> >>>>> account for some feature of the data, and the framework skips them
> (as
> >>>>> opposed to crashing).
> >>>>>
> >>>>> In this breakdowns, the skips I'm adding metrics for are all
> >>> uncontrolled
> >>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
> >> are
> >>>>> well documented, so it wouldn't be terrible to have an application
in
> >>> which
> >>>>> you know you expect to have tons of uncontrolled skips, but it's
not
> >>> great
> >>>>> either, since you may also have some *unexpected* uncontrolled skips.
> >>> It'll
> >>>>> be difficult to notice, since you're probably not alerting on the
> >> metric
> >>>>> and filtering out the logs (whatever their level).
> >>>>>
> >>>>> I'd recommend any app author, as an alternative, to convert all
> >> expected
> >>>>> skips to controlled ones, by updating the topology to filter those
> >>> records
> >>>>> out.
> >>>>>
> >>>>> Following from my recommendation, as a library author, I'm inclined
> to
> >>> mark
> >>>>> those logs WARN, since in my opinion, they should be concerning
to
> the
> >>> app
> >>>>> authors. I'd definitely want to show, rather than hide, them by
> >>> default, so
> >>>>> I would pick INFO at least.
> >>>>>
> >>>>> That said, logging is always a tricky issue for lower-level libraries
> >>> that
> >>>>> run inside user code, since we don't have all the information we
need
> >> to
> >>>>> make the right call.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On your last note, yeah, I got that impression from Guozhang as
well.
> >>>>> Thanks for the clarification.
> >>>>>
> >>>>> -John
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> >> matthias@confluent.io>
> >>>>> wrote:
> >>>>>
> >>>>>> John,
> >>>>>>
> >>>>>> sorry for my late reply and thanks for updating the KIP.
> >>>>>>
> >>>>>> I like your approach about "metrics are for monitoring, logs
are for
> >>>>>> debugging" -- however:
> >>>>>>
> >>>>>> 1) I don't see a connection between this and the task-level
metrics
> >>> that
> >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
> think
> >>>>>> people would monitor the `TopologyTestDriver` an thus wondering
why
> >> it
> >>>>>> is important to include the metrics there? Thread-level metric
might
> >> be
> >>>>>> easier to monitor though (ie, less different metric to monitor).
> >>>>>>
> >>>>>> 2) I am a little worried about WARN level logging and that it
might
> >> be
> >>>>>> too chatty -- as you pointed out, it's about debugging, thus
DEBUG
> >>> level
> >>>>>> might be better. Not 100% sure about this to be honest. What
is the
> >>>>>> general assumption about the frequency for skipped records?
I could
> >>>>>> imagine cases for which skipped records are quite frequent and
thus,
> >>>>>> WARN level logs might "flood" the logs
> >>>>>>
> >>>>>> One final remark:
> >>>>>>
> >>>>>>> More
> >>>>>>> generally, I would like to establish a pattern in which
we could
> add
> >>>>> new
> >>>>>>> values for the "reason" tags without needing a KIP to do
so.
> >>>>>>
> >>>>>> From my understanding, this is not feasible. Changing metrics
is
> >> always
> >>>>>> considered a public API change, and we need a KIP for any change.
As
> >> we
> >>>>>> moved away from tagging, it doesn't matter for the KIP anymore
--
> >> just
> >>>>>> wanted to point it out.
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> >>>>>>> Allrighty! The KIP is updated.
> >>>>>>>
> >>>>>>> Thanks again, all, for the feedback.
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <john@confluent.io>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hey Guozhang and Bill,
> >>>>>>>>
> >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus,
I'd
> >>> like
> >>>>>> to
> >>>>>>>> put it in the task instead of the thread so that it'll
show up in
> >> the
> >>>>>>>> TopologyTestDriver metrics as well.
> >>>>>>>>
> >>>>>>>> I'm leaning toward keeping the scope where it is right
now, but if
> >>>>>> others
> >>>>>>>> want to advocate for tossing in some more metrics, we
can go that
> >>>>> route.
> >>>>>>>>
> >>>>>>>> Thanks all,
> >>>>>>>> -John
> >>>>>>>>
> >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbejeck@gmail.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks for the KIP John, and sorry for the late
comments.
> >>>>>>>>>
> >>>>>>>>> I'm on the fence with providing a single level metrics,
but I
> >> think
> >>>>>> we'll
> >>>>>>>>> have that discussion outside of this KIP.
> >>>>>>>>>
> >>>>>>>>>> * maintain one skipped-record metric (could
be per-thread,
> >>> per-task,
> >>>>>> or
> >>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>> * introduce a warn-level log detailing the
> topic/partition/offset
> >>>>> and
> >>>>>>>>>> reason of the skipped record
> >>>>>>>>>
> >>>>>>>>> I'm +1 on both of these suggestions.
> >>>>>>>>>
> >>>>>>>>> Finally, we have had requests in the past for some
metrics around
> >>>>> when
> >>>>>>>>> persistent store removes an expired window.  Would
adding that to
> >>> our
> >>>>>>>>> metrics stretch the scope of this KIP too much?
> >>>>>>>>>
> >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> >>>>>>>>>
> >>>>>>>>> Bill
> >>>>>>>>>
> >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> >> wangguoz@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The proposal sounds good to me. About "maintain
only one level
> of
> >>>>>>>>> metrics"
> >>>>>>>>>> maybe we can discuss about that separately from
this KIP since
> >> that
> >>>>>>>>> would
> >>>>>>>>>> be a larger scope of discussion. I agree that
if we are going to
> >>>>>>>>> maintain
> >>>>>>>>>> only one-level metrics it should be lowest level
and we would
> let
> >>>>>> users
> >>>>>>>>> to
> >>>>>>>>>> do the roll-ups themselves, but I'm still not
fully convinced
> >> that
> >>>>> we
> >>>>>>>>>> should just provide single-level metrics, because
1) I think for
> >>>>>>>>> different
> >>>>>>>>>> metrics people may be interested to investigate
into different
> >>>>>>>>>> granularities, e.g. for poll / commit rate these
are at the
> >> lowest
> >>>>>>>>>> task-level metrics, while for process-rate /
skip-rate they can
> >> be
> >>>>> as
> >>>>>>>>> low
> >>>>>>>>>> as processor-node metrics, and 2) user-side
rolling ups may not
> >> be
> >>>>>> very
> >>>>>>>>>> straight-forward. But for 2) if someone can
provide an efficient
> >>> and
> >>>>>>>>> easy
> >>>>>>>>>> implementation of that I can be persuaded :)
> >>>>>>>>>>
> >>>>>>>>>> For now I'm thinking we can add the metric on
thread-level,
> >> either
> >>>>>> with
> >>>>>>>>>> finer grained ones with "reason" tag plus an
aggregated one
> >> without
> >>>>>> the
> >>>>>>>>>> tag, or just having a single aggregated metric
without the tag
> >>> looks
> >>>>>>>>> good
> >>>>>>>>>> to me.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler
<
> john@confluent.io
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can
dig it. I'll provide
> >> a
> >>>>>>>>> list in
> >>>>>>>>>>> the KIP. I was also thinking we'd better
start a documentation
> >>> page
> >>>>>>>>> with
> >>>>>>>>>>> the metrics listed.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd have no problem logging a warning when
we skip records. On
> >> the
> >>>>>>>>> metric
> >>>>>>>>>>> front, really I'm just pushing for us to
maintain only one
> level
> >>> of
> >>>>>>>>>>> metrics. If that's more or less granular
(i.e., maybe we don't
> >>>>> have a
> >>>>>>>>>>> metric per reason and log the reason instead),
that's fine by
> >> me.
> >>> I
> >>>>>>>>> just
> >>>>>>>>>>> don't think it provides a lot of extra value
per complexity
> >>>>>> (interface
> >>>>>>>>>> and
> >>>>>>>>>>> implementation) to maintain roll-ups at
the thread level in
> >>>>> addition
> >>>>>>>>> to
> >>>>>>>>>>> lower-level metrics.
> >>>>>>>>>>>
> >>>>>>>>>>> How about this instead:
> >>>>>>>>>>> * maintain one skipped-record metric (could
be per-thread,
> >>>>> per-task,
> >>>>>>>>> or
> >>>>>>>>>>> per-processor-node) with no "reason"
> >>>>>>>>>>> * introduce a warn-level log detailing the
> >> topic/partition/offset
> >>>>> and
> >>>>>>>>>>> reason of the skipped record
> >>>>>>>>>>>
> >>>>>>>>>>> If you like that, I can update the KIP.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> -John
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang
Wang <
> >>> wangguoz@gmail.com
> >>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>> One thing you mention is the notion
of setting alerts on
> >> coarser
> >>>>>>>>>>> metrics
> >>>>>>>>>>>> being easier than finer ones. All the
metric alerting systems
> I
> >>>>> have
> >>>>>>>>>> used
> >>>>>>>>>>>> make it equally easy to alert on metrics
by-tag or over tags.
> >> So
> >>>>> my
> >>>>>>>>>>>> experience doesn't say that this is
a use case. Were you
> >> thinking
> >>>>>>>>> of an
> >>>>>>>>>>>> alerting system that makes such a pre-aggregation
valuable?
> >>>>>>>>>>>>
> >>>>>>>>>>>> For the commonly used JMX reporter tags
will be encoded
> >> directly
> >>>>> as
> >>>>>>>>>> part
> >>>>>>>>>>> of
> >>>>>>>>>>>> the object name, and if users wants
to monitor them they need
> >> to
> >>>>>>>>> know
> >>>>>>>>>>> these
> >>>>>>>>>>>> values before hand. That is also why
I think we do want to
> list
> >>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>> possible values of the reason tags in
the KIP, since
> >>>>>>>>>>>>
> >>>>>>>>>>>>> In my email in response to Matthias,
I gave an example of the
> >>>>>>>>> kind of
> >>>>>>>>>>>> scenario that would lead me as an operator
to run with DEBUG
> on
> >>>>> all
> >>>>>>>>> the
> >>>>>>>>>>>> time, since I wouldn't be sure, having
seen a skipped record
> >>> once,
> >>>>>>>>> that
> >>>>>>>>>>> it
> >>>>>>>>>>>> would ever happen again. The solution
is to capture all the
> >>>>>>>>> available
> >>>>>>>>>>>> information about the reason and location
of skips all the
> >> time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> That is a good point. I think we can
either expose all levels
> >>>>>>>>> metrics
> >>>>>>>>>> as
> >>>>>>>>>>> by
> >>>>>>>>>>>> default, or only expose the most lower-level
metrics and get
> >> rid
> >>>>> of
> >>>>>>>>>> other
> >>>>>>>>>>>> levels to let users do roll-ups themselves
(which will be a
> >> much
> >>>>>>>>> larger
> >>>>>>>>>>>> scope for discussion), or we can encourage
users to not purely
> >>>>>>>>> depend
> >>>>>>>>>> on
> >>>>>>>>>>>> metrics for such trouble shooting: that
is to say, users only
> >> be
> >>>>>>>>>> alerted
> >>>>>>>>>>>> based on metrics, and we can log a info
/ warn log4j entry
> each
> >>>>>>>>> time we
> >>>>>>>>>>> are
> >>>>>>>>>>>> about to skip a record all over the
places, so that upon being
> >>>>>>>>> notified
> >>>>>>>>>>>> users can look into the logs to find
the details on where /
> >> when
> >>>>> it
> >>>>>>>>>>>> happens. WDYT?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John
Roesler <
> >> john@confluent.io
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hey Guozhang,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the review.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1.
> >>>>>>>>>>>>> Matthias raised the same question
about the "reason" tag
> >> values.
> >>>>> I
> >>>>>>>>>> can
> >>>>>>>>>>>> list
> >>>>>>>>>>>>> all possible values of the "reason"
tag, but I'm thinking
> this
> >>>>>>>>> level
> >>>>>>>>>> of
> >>>>>>>>>>>>> detail may not be KIP-worthy, maybe
the code and
> documentation
> >>>>>>>>> review
> >>>>>>>>>>>> would
> >>>>>>>>>>>>> be sufficient. If you all disagree
and would like it included
> >> in
> >>>>>>>>> the
> >>>>>>>>>>>> KIP, I
> >>>>>>>>>>>>> can certainly do that.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If we do provide roll-up metrics,
I agree with the pattern of
> >>>>>>>>> keeping
> >>>>>>>>>>> the
> >>>>>>>>>>>>> same name but eliminating the tags
for the dimensions that
> >> were
> >>>>>>>>>>>> rolled-up.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2.
> >>>>>>>>>>>>> I'm not too sure that implementation
efficiency really
> >> becomes a
> >>>>>>>>>> factor
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> choosing whether to (by default)
update one coarse metric at
> >> the
> >>>>>>>>>> thread
> >>>>>>>>>>>>> level or one granular metric at
the processor-node level,
> >> since
> >>>>>>>>> it's
> >>>>>>>>>>> just
> >>>>>>>>>>>>> one metric being updated either
way. I do agree that if we
> >> were
> >>>>> to
> >>>>>>>>>>> update
> >>>>>>>>>>>>> the granular metrics and multiple
roll-ups, then we should
> >>>>>>>>> consider
> >>>>>>>>>> the
> >>>>>>>>>>>>> efficiency.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I agree it's probably not necessary
to surface the metrics
> for
> >>>>> all
> >>>>>>>>>>> nodes
> >>>>>>>>>>>>> regardless of whether they can or
do skip records. Perhaps we
> >>> can
> >>>>>>>>>>> lazily
> >>>>>>>>>>>>> register the metrics.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> In my email in response to Matthias,
I gave an example of the
> >>>>>>>>> kind of
> >>>>>>>>>>>>> scenario that would lead me as an
operator to run with DEBUG
> >> on
> >>>>>>>>> all
> >>>>>>>>>> the
> >>>>>>>>>>>>> time, since I wouldn't be sure,
having seen a skipped record
> >>>>> once,
> >>>>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> would ever happen again. The solution
is to capture all the
> >>>>>>>>> available
> >>>>>>>>>>>>> information about the reason and
location of skips all the
> >> time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> One thing you mention is the notion
of setting alerts on
> >> coarser
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>> being easier than finer ones. All
the metric alerting systems
> >> I
> >>>>>>>>> have
> >>>>>>>>>>> used
> >>>>>>>>>>>>> make it equally easy to alert on
metrics by-tag or over tags.
> >> So
> >>>>>>>>> my
> >>>>>>>>>>>>> experience doesn't say that this
is a use case. Were you
> >>> thinking
> >>>>>>>>> of
> >>>>>>>>>> an
> >>>>>>>>>>>>> alerting system that makes such
a pre-aggregation valuable?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks again,
> >>>>>>>>>>>>> -John
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM,
Guozhang Wang <
> >>>>>>>>> wangguoz@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hello John,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Could you list all the possible
values of the "reason"
> >> tag?
> >>>>>>>>> In
> >>>>>>>>>> the
> >>>>>>>>>>>>> JIRA
> >>>>>>>>>>>>>> ticket I left some potential
reasons but I'm not clear if
> >>> you're
> >>>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>> categorize each of them as a
separate reason, or is there
> any
> >>>>>>>>>>>> additional
> >>>>>>>>>>>>>> ones you have in mind.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also I'm wondering if we should
add another metric that do
> >> not
> >>>>>>>>> have
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> reason tag but aggregates among
all possible reasons? This
> is
> >>>>>>>>> for
> >>>>>>>>>>> users
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> easily set their alerting notifications
(otherwise they have
> >> to
> >>>>>>>>>> write
> >>>>>>>>>>>> on
> >>>>>>>>>>>>>> notification rule per reason)
in their monitoring systems.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Note that the processor-node
metrics is actually
> >>> "per-thread,
> >>>>>>>>>>>>> per-task,
> >>>>>>>>>>>>>> per-processor-node", and today
we only set the per-thread
> >>>>>>>>> metrics
> >>>>>>>>>> as
> >>>>>>>>>>>> INFO
> >>>>>>>>>>>>>> while leaving the lower two
layers as DEBUG. I agree with
> >> your
> >>>>>>>>>>> argument
> >>>>>>>>>>>>>> that we are missing the per-client
roll-up metrics today,
> but
> >>>>>>>>> I'm
> >>>>>>>>>>>>> convinced
> >>>>>>>>>>>>>> that the right way to approach
it would be
> >>>>>>>>>>> "just-providing-the-lowest-
> >>>>>>>>>>>>>> level
> >>>>>>>>>>>>>> metrics only".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Note the recoding implementation
of these three levels are
> >>>>>>>>>> different
> >>>>>>>>>>>>>> internally today: we did not
just do the rolling up to
> >> generate
> >>>>>>>>> the
> >>>>>>>>>>>>>> higher-level metrics from the
lower level ones, but we just
> >>>>>>>>> record
> >>>>>>>>>>> them
> >>>>>>>>>>>>>> separately, which means that,
if we turn on multiple levels
> >> of
> >>>>>>>>>>> metrics,
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>> maybe duplicate collecting some
metrics. One can argue that
> >> is
> >>>>>>>>> not
> >>>>>>>>>>> the
> >>>>>>>>>>>>> best
> >>>>>>>>>>>>>> way to represent multi-level
metrics collecting and
> >> reporting,
> >>>>>>>>> but
> >>>>>>>>>> by
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>> enabling thread-level metrics
as INFO today, that
> >>> implementation
> >>>>>>>>>>> could
> >>>>>>>>>>>> be
> >>>>>>>>>>>>>> more efficient than only collecting
the metrics at the
> lowest
> >>>>>>>>>> level,
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>> then do the roll-up calculations
outside of the metrics
> >>> classes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Plus, today not all processor-nodes
may possibly skip
> >> records,
> >>>>>>>>>> AFAIK
> >>>>>>>>>>> we
> >>>>>>>>>>>>>> will only skip records at the
source, sink, window and
> >>>>>>>>> aggregation
> >>>>>>>>>>>>>> processor nodes, so adding a
metric per processor looks like
> >> an
> >>>>>>>>>>>> overkill
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> me as well. On the other hand,
from user's perspective the
> >>>>>>>>> "reason"
> >>>>>>>>>>> tag
> >>>>>>>>>>>>> may
> >>>>>>>>>>>>>> be sufficient for them to narrow
down where inside the
> >> topology
> >>>>>>>>> is
> >>>>>>>>>>>>> causing
> >>>>>>>>>>>>>> records to be dropped on the
floor. So I think the
> >> "per-thread,
> >>>>>>>>>>>> per-task"
> >>>>>>>>>>>>>> level metrics should be sufficient
for them in trouble shoot
> >> in
> >>>>>>>>>> DEBUG
> >>>>>>>>>>>>> mode,
> >>>>>>>>>>>>>> and we can add another "per-thread"
level metrics as INFO
> >> which
> >>>>>>>>> is
> >>>>>>>>>>>> turned
> >>>>>>>>>>>>>> on by default. So under normal
execution users still only
> >> need
> >>>>>>>>> INFO
> >>>>>>>>>>>> level
> >>>>>>>>>>>>>> metrics for alerting (e.g. set
alerts on all skipped-records
> >>>>>>>>>> metrics
> >>>>>>>>>>> as
> >>>>>>>>>>>>>> non-zero), and then upon trouble
shooting they can turn on
> >>> DEBUG
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>> look into which task is actually
causing the skipped
> records.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Guozhang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03
PM, Matthias J. Sax <
> >>>>>>>>>>>> matthias@confluent.io>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks for the KIP John.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Reading the material on
the related Jiras, I am wondering
> >> what
> >>>>>>>>>>>> `reason`
> >>>>>>>>>>>>>>> tags you want to introduce?
Can you elaborate? The KIP
> >> should
> >>>>>>>>>> list
> >>>>>>>>>>>>> those
> >>>>>>>>>>>>>>> IMHO.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> About the fine grained metrics
vs the roll-up: you say that
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the coarse metric aggregates
across two dimensions
> >>>>>>>>>> simultaneously
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Can you elaborate why this
is an issue? I am not convinced
> >> atm
> >>>>>>>>>> that
> >>>>>>>>>>>> we
> >>>>>>>>>>>>>>> should put the fine grained
metrics into INFO level and
> >> remove
> >>>>>>>>>> the
> >>>>>>>>>>>>>>> roll-up at thread level.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Given that they have
to do this sum to get a usable
> >>>>>>>>> top-level
> >>>>>>>>>>> view
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is a fair concern,
but I don't share the conclusion.
> >>>>>>>>>> Offering
> >>>>>>>>>>> a
> >>>>>>>>>>>>>>> built-in `KafkaStreams`
"client" roll-up out of the box
> >> might
> >>>>>>>>> be
> >>>>>>>>>> a
> >>>>>>>>>>>>>>> better solution. In the
past we did not offer this due to
> >>>>>>>>>>> performance
> >>>>>>>>>>>>>>> concerns, but we could allow
an "opt-in" mechanism. If you
> >>>>>>>>>>> disagree,
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>>> you provide some reasoning
and add them to the "Rejected
> >>>>>>>>>>>> alternatives"
> >>>>>>>>>>>>>>> section.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> To rephrase: I understand
the issue about missing top-level
> >>>>>>>>> view,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>> instead of going more fine
grained, we should consider to
> >> add
> >>>>>>>>>> this
> >>>>>>>>>>>>>>> top-level view and add/keep
the fine grained metrics at
> >> DEBUG
> >>>>>>>>>> level
> >>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics()
and to remove
> >> old
> >>>>>>>>>>> metrics
> >>>>>>>>>>>>>>> directly as you suggested.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted
Yu wrote:
> >>>>>>>>>>>>>>>> Looks good to me.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018
at 3:11 PM, John Roesler <
> >>>>>>>>>> john@confluent.io
> >>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am proposing KIP-274
to improve the metrics around
> >>>>>>>>> skipped
> >>>>>>>>>>>> records
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>> Streams.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Please find the
details here:
> >>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Please let me know
what you think!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> --
> >>>>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>
>


-- 
-- Guozhang

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message