kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Roesler <j...@confluent.io>
Subject Re: [DISCUSS] KIP-471: Expose RocksDB Metrics in Kafka Streams
Date Tue, 16 Jul 2019 15:37:13 GMT
Hi Patrik,

Ah, I guess that was an important distinction! Thanks for clarifying.
If there's a method to call on the RocksDB, then you probably want to
consider just registering the metric as a Gauge that delegates to
RocksDB to get the value, as opposed to the normal pattern of using a
Sensor with sampled stats metrics.

Thanks,
-John

On Mon, Jul 15, 2019 at 1:05 PM Patrik Kleindl <pkleindl@gmail.com> wrote:
>
> Hi John
> Translated it wrong, I meant ‚might‘ instead of ‚may‘.
> If I find the time I‘ll take a look at the code how they could be added as metrics.
> Thanks for your input.
> Regards
> Patrik
>
> > Am 15.07.2019 um 18:03 schrieb John Roesler <john@confluent.io>:
> >
> > Hey Patrik,
> >
> > Since KIP-471 is already accepted, and since this idea is not a
> > trivial extension of the KIP, I think we'd need to do a new KIP.
> >
> > Some points to consider: these additions could not be made to the
> > KeyValueStore interface, since they'd be only applicable to
> > RocksDB-backed stores, but there is no RocksDBKeyValueStore<K,V>,
> > which could be used to query the property. Rather, the rocks interface
> > is wrapped inside multiple layers that comprise the KeyValueStore.
> > Since these layers themselves are not public interfaces, it would be a
> > challenge to think of how to get access to these methods from
> > user-space code.
> >
> > On the other hand, could we just add new metrics to access those
> > properties? Offhand, it's unclear to me why you said they "may not be
> > exposed".
> >
> > Thanks,
> > -John
> >
> >> On Mon, Jul 15, 2019 at 5:11 AM Patrik Kleindl <pkleindl@gmail.com> wrote:
> >>
> >> Hi
> >> Adding this here because I asked and may have found the answer:
> >> The memory consumption may not be exposed as RocksDB metrics, but they
> >> should be available as properties of the RocksDB instance itself.
> >> RocksDBStore could easily make this available by allowing access
> >> to db.getProperty.
> >> Available data (examples):
> >> rocksdb.block-cache-usage
> >> rocksdb.estimate-table-readers-mem
> >> rocksdb.cur-size-all-mem-tables
> >>
> >> Reference: https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
and
> >> found the names of the properties here
> >> https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h
> >>
> >> Would it make sense to allow this kind of access?
> >> Should this be part of a follow-up KIP?
> >>
> >> best regards
> >>
> >> Patrik
> >>
> >>> On Wed, 19 Jun 2019 at 21:55, Bill Bejeck <bbejeck@gmail.com> wrote:
> >>>
> >>> Hi Bruno,
> >>>
> >>> Just getting caught up on this KIP thread.  Looks good to me, and I don't
> >>> have any additional comments to what's already been presented.
> >>>
> >>> Thanks,
> >>> Bill
> >>>
> >>>> On Wed, Jun 19, 2019 at 1:42 PM Bruno Cadonna <bruno@confluent.io>
wrote:
> >>>>
> >>>> John and Guozhang,
> >>>>
> >>>> thank you for your comments.
> >>>>
> >>>> @Guozhang could you please also vote on the voting thread so that we
> >>>> have all votes in one place.
> >>>>
> >>>> @John, the only situation I can think of where a non-uniform
> >>>> configuration of segments would make sense is to account for
> >>>> seasonality. But this would be a really advanced configuration IMO.
> >>>>
> >>>> Best,
> >>>> Bruno
> >>>>
> >>>>> On Wed, Jun 19, 2019 at 7:18 PM John Roesler <john@confluent.io>
wrote:
> >>>>>
> >>>>> One last thought. I think it makes sense what you propose for merging
> >>>>> the metrics when a logical store is composed of multiple physical
> >>>>> stores.
> >>>>>
> >>>>> The basic standard for these metrics is that they should be relevant
> >>>>> to performance, and they should be controllable via configurations,
> >>>>> specifically via RocksDBConfigSetter. The config setter takes as
input
> >>>>> the store name. For Key/Value stores, this name is visible in the
tags
> >>>>> as "rocksdb-state-id", so the workflow is that I notice (e.g.) the
> >>>>> metric rocksdb-state-id=STORE_X is showing a low block cache hit
> >>>>> ratio, so I add a condition in my config setter to increase the
block
> >>>>> cache size for stores named STORE_X.
> >>>>>
> >>>>> For this case where the logical store has multiple physical stores,
> >>>>> we're really talking about segmented stores, window stores and
> >>>>> segmented stores. In these stores, every segment has a different
name
> >>>>> (logicalStoreName + "." + segmentId * segmentInterval), so the config
> >>>>> setter would need to use a prefix match with respect to the metric
tag
> >>>>> (e.g.) rocksdb-window-state-id. But at the same time, this is totally
> >>>>> doable.
> >>>>>
> >>>>> It's also perfectly reasonable, since segments get rotated out all
the
> >>>>> time, it's implausible that you'd ever want a non-uniform
> >>>>> configuration over the segments in a store. For this reason,
> >>>>> specifically, it makes more sense just to summarize the metrics
than
> >>>>> to present them individually. Might be worth documenting the
> >>>>> relationship, though.
> >>>>>
> >>>>> Thanks again, I'll vote now,
> >>>>> -John
> >>>>>
> >>>>> On Wed, Jun 19, 2019 at 12:03 PM John Roesler <john@confluent.io>
> >>> wrote:
> >>>>>>
> >>>>>> Just taking a look over the metrics again, I had one thought...
> >>>>>>
> >>>>>> Stuff that happens in a background thread (like compaction metrics)
> >>>>>> can't directly identify compactions as a bottleneck from Streams'
> >>>>>> perspective. I.e., a DB might do a lot of compactions, but if
those
> >>>>>> compactions never delay a write or read, then they cannot be
a
> >>>>>> bottleneck.
> >>>>>>
> >>>>>> Thus, the "stall" metric should be the starting point for bottleneck
> >>>>>> identification, and then the flush/compaction metrics can be
used to
> >>>>>> secondarily identify what to do to relieve the bottleneck.
> >>>>>>
> >>>>>> This doesn't affect the metrics you proposed, but I'd suggest
saying
> >>>>>> something to this effect in whatever documentation or descriptions
we
> >>>>>> provide.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> -John
> >>>>>>
> >>>>>> On Wed, Jun 19, 2019 at 11:25 AM John Roesler <john@confluent.io>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Thanks for the updates.
> >>>>>>>
> >>>>>>> Personally, I'd be in favor of not going out on a limb with
> >>>>>>> unsupported metrics APIs. We should take care to make sure
that
> >>> what
> >>>>>>> we add in KIP-471 is stable and well supported, even if
it's not
> >>> the
> >>>>>>> complete picture. We can always do follow-on work to tackle
complex
> >>>>>>> metrics as an isolated design exercise.
> >>>>>>>
> >>>>>>> Just my two cents.
> >>>>>>> Thanks,
> >>>>>>> -John
> >>>>>>>
> >>>>>>> On Wed, Jun 19, 2019 at 6:02 AM Bruno Cadonna <bruno@confluent.io>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi Guozhang,
> >>>>>>>>
> >>>>>>>> Regarding your comments about the wiki page:
> >>>>>>>>
> >>>>>>>> 1) Exactly, I rephrased the paragraph to make it more
clear.
> >>>>>>>>
> >>>>>>>> 2) Yes, I used the wrong term. All hit related metrics
are
> >>> ratios.
> >>>> I
> >>>>>>>> corrected the names of the affected metrics.
> >>>>>>>>
> >>>>>>>> Regarding your meta comments:
> >>>>>>>>
> >>>>>>>> 1) The plan is to expose the hit ratio. I used the wrong
term.
> >>> The
> >>>>>>>> formulas compute ratios. Regarding your question about
a metric
> >>> to
> >>>>>>>> know from where a read is served when it is not in the
memtable,
> >>>> there
> >>>>>>>> are metrics in RocksDB that give you the number of get()
queries
> >>>> that
> >>>>>>>> are served from L0, L1, and L2_AND_UP. I could not find
any
> >>> metric
> >>>>>>>> that give you information about whether a query was
served from
> >>>> disk
> >>>>>>>> vs. OS cache. One metric that could be used to indirectly
measure
> >>>>>>>> whether disk or OS cache is accessed seems to be
> >>>> READ_BLOCK_GET_MICROS
> >>>>>>>> that gives you the time for an IO read of a block. If
it is high,
> >>>> it
> >>>>>>>> was read from disk, otherwise from the OS cache. A similar
> >>>> strategy to
> >>>>>>>> monitor the performance is described in [1]. DISCLAIMER:
> >>>>>>>> READ_BLOCK_GET_MICROS is not documented. I had to look
into the
> >>> C++
> >>>>>>>> code to understand its meaning. I could have missed
something.
> >>>>>>>>
> >>>>>>>> 2) There are some additional compaction statistics that
contain
> >>>> sizes
> >>>>>>>> of files on disk and numbers about write amplification
that you
> >>> can
> >>>>>>>> get programmatically in RocksDB, but they are for debugging
> >>>> purposes
> >>>>>>>> [2]. To get this data and publish it into a metric,
one has to
> >>>> parse a
> >>>>>>>> string. Since this data is for debugging purposes, I
do not know
> >>>> how
> >>>>>>>> stable the output format is. One thing, we could do,
is to dump
> >>> the
> >>>>>>>> string with the compaction statistics into our log files
at DEBUG
> >>>>>>>> level. But that is outside of the scope of this KIP.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Bruno
> >>>>>>>>
> >>>>>>>> [1]
> >>>>
> >>> https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context#block-cache-and-os-page-cache-efficiency
> >>>>>>>> [2]
> >>>>
> >>> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#rocksdb-statistics
> >>>>>>>>
> >>>>>>>> On Tue, Jun 18, 2019 at 8:24 PM Guozhang Wang <
> >>> wangguoz@gmail.com>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hello Bruno,
> >>>>>>>>>
> >>>>>>>>> I've read through the aggregation section and I
think they look
> >>>> good to me.
> >>>>>>>>> There are a few minor comments about the wiki page
itself:
> >>>>>>>>>
> >>>>>>>>> 1) A state store might consist of multiple state
stores -> You
> >>>> mean a
> >>>>>>>>> `logical` state store be consistent of multiple
`physical`
> >>> store
> >>>> instances?
> >>>>>>>>>
> >>>>>>>>> 2) The "Hit Rates" calculation seems to be referring
to the
> >>> `Hit
> >>>> Ratio`
> >>>>>>>>> (which is a percentage) than `Hit Rate`?
> >>>>>>>>>
> >>>>>>>>> And a couple further meta comments:
> >>>>>>>>>
> >>>>>>>>> 1) For memtable / block cache, instead of the hit-rate
do you
> >>>> think we
> >>>>>>>>> should expose the hit-ratio? I felt it is more useful
for users
> >>>> to debug
> >>>>>>>>> what's the root cause of unexpected slow performance.
> >>>>>>>>>
> >>>>>>>>> And for block cache misses, is it easy to provide
a metric as
> >>> of
> >>>> "target
> >>>>>>>>> read" of where a read is served (from which level,
either in OS
> >>>> cache or in
> >>>>>>>>> SST files), similar to Fig.11 in
> >>>>>>>>> http://cidrdb.org/cidr2017/papers/p82-dong-cidr17.pdf?
> >>>>>>>>>
> >>>>>>>>> 2) As @Patrik mentioned, is there a good way we
can expose the
> >>>> total amount
> >>>>>>>>> of memory and disk usage for each state store as
well? I think
> >>>> it would
> >>>>>>>>> also be very helpful for users to understand their
capacity
> >>>> needs and read
> >>>>>>>>> / write amplifications.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Guozhang
> >>>>>>>>>
> >>>>>>>>> On Fri, Jun 14, 2019 at 6:55 AM Bruno Cadonna <
> >>>> bruno@confluent.io> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I decided to go for the option in which metrics
are exposed
> >>>> for each
> >>>>>>>>>> logical state store. I revisited the KIP correspondingly
and
> >>>> added a
> >>>>>>>>>> section on how to aggregate metrics over multiple
physical
> >>>> RocksDB
> >>>>>>>>>> instances within one logical state store. Would
be great, if
> >>>> you could
> >>>>>>>>>> take a look and give feedback. If nobody has
complaints about
> >>>> the
> >>>>>>>>>> chosen option I would proceed with voting on
this KIP since
> >>>> this was
> >>>>>>>>>> the last open question.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Bruno
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jun 7, 2019 at 9:38 PM Patrik Kleindl
<
> >>>> pkleindl@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Sophie
> >>>>>>>>>>> This will be a good change, I have been
thinking about
> >>>> proposing
> >>>>>>>>>> something similar or even passing the properties
per store.
> >>>>>>>>>>> RocksDB should probably know how much memory
was reserved
> >>>> but maybe does
> >>>>>>>>>> not expose it.
> >>>>>>>>>>> We are limiting it already as you suggested
but this is a
> >>>> rather crude
> >>>>>>>>>> tool.
> >>>>>>>>>>> Especially in a larger topology with mixed
loads par topic
> >>>> it would be
> >>>>>>>>>> helpful to get more insights which store puts
a lot of load
> >>> on
> >>>> memory.
> >>>>>>>>>>> Regarding the limiting capability, I think
I remember
> >>>> reading that those
> >>>>>>>>>> only affect some parts of the memory and others
can still
> >>>> exceed this
> >>>>>>>>>> limit. I‘ll try to look up the difference.
> >>>>>>>>>>> Best regards
> >>>>>>>>>>> Patrik
> >>>>>>>>>>>
> >>>>>>>>>>>> Am 07.06.2019 um 21:03 schrieb Sophie
Blee-Goldman <
> >>>>>>>>>> sophie@confluent.io>:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi Patrik,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As of 2.3 you will be able to use the
RocksDBConfigSetter
> >>>> to
> >>>>>>>>>> effectively
> >>>>>>>>>>>> bound the total memory used by RocksDB
for a single app
> >>>> instance. You
> >>>>>>>>>>>> should already be able to limit the
memory used per
> >>>> rocksdb store,
> >>>>>>>>>> though
> >>>>>>>>>>>> as you mention there can be a lot of
them. I'm not sure
> >>>> you can
> >>>>>>>>>> monitor the
> >>>>>>>>>>>> memory usage if you are not limiting
it though.
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Jun 7, 2019 at 2:06 AM Patrik
Kleindl <
> >>>> pkleindl@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi
> >>>>>>>>>>>>> Thanks Bruno for the KIP, this is
a very good idea.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have one question, are there metrics
available for the
> >>>> memory
> >>>>>>>>>> consumption
> >>>>>>>>>>>>> of RocksDB?
> >>>>>>>>>>>>> As they are running outside the
JVM we have run into
> >>>> issues because
> >>>>>>>>>> they
> >>>>>>>>>>>>> were using all the other memory.
> >>>>>>>>>>>>> And with multiple streams applications
on the same
> >>>> machine, each with
> >>>>>>>>>>>>> several KTables and 10+ partitions
per topic the number
> >>>> of stores can
> >>>>>>>>>> get
> >>>>>>>>>>>>> out of hand pretty easily.
> >>>>>>>>>>>>> Or did I miss something obvious
how those can be
> >>>> monitored better?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> best regards
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Patrik
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, 17 May 2019 at 23:54,
Bruno Cadonna <
> >>>> bruno@confluent.io>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> this KIP describes the extension
of the Kafka Streams'
> >>>> metrics to
> >>>>>>>>>> include
> >>>>>>>>>>>>>> RocksDB's internal statistics.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please have a look at it and
let me know what you
> >>> think.
> >>>> Since I am
> >>>>>>>>>> not a
> >>>>>>>>>>>>>> RocksDB expert, I am thankful
for any additional pair
> >>> of
> >>>> eyes that
> >>>>>>>>>>>>>> evaluates this KIP.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-471:+Expose+RocksDB+Metrics+in+Kafka+Streams
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>> Bruno
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> -- Guozhang
> >>>>
> >>>

Mime
View raw message