kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tom Bentley <tbent...@redhat.com>
Subject Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect
Date Thu, 16 Jan 2020 09:15:54 GMT
Hi Konstantine,

b) I thought hard but briefly about this. Although there's potential for
> ambiguity if someone inspects the key on its own, I believe that if we
> combine this information with what we save in the value, the ambiguity is
> removed. With respect to whether a connector can override another
> connector's key, I don't think that's possible for any combination of
> unique topic name and connector name. I hope I'm not missing anything.
>

I think it is possible for connectors to have conflicting keys:

* topic-name="foo-connector" and connector-name="bar"
* topic-name="foo" and connector-name="connector-bar"

would both have the key "topic-foo-connector-connector-bar"


> c) That means that worker tasks will produce topic status records once they
> detect that the worker they are running on does not include a topic that is
> used by the tasks in the list of active topics for this connector. But once
> the worker adds this topic to the set of active topics for this connector,
> then the worker tasks stop producing those messages. They'll produce one
> again if the worker stops including that topic in the active topics set for
> some reason. I could improve the wording on the KIP, but I'd also like to
> know we are on the same page re: the design here.
>

OK, I understand now. I think it would be helpful to include the example
you gave in your previous reply.

Many thanks,

Tom


>
> Let me know if the above address your questions.
> Thanks,
> Konstantine
>
>
>
> On Wed, Jan 15, 2020 at 12:05 PM Randall Hauch <rhauch@gmail.com> wrote:
>
> > Almog,
> >
> > You raise some interesting questions. Comments inline below.
> >
> > On Wed, Jan 15, 2020 at 11:19 AM Almog Gavra <almog@confluent.io> wrote:
> >
> > > Hi Konstantine,
> > >
> > > Thanks for the KIP! This is going to make automatic integration with
> > > Connect much more powerful.
> > >
> > > My thoughts are mostly around freshness of the data and being able to
> > > expose that to users. Riffing on Randall's timestamp question - have we
> > > considered adding some interval at which point a connector will
> republish
> > > any topics that it encounters and update the timestamp? That way we
> have
> > > some refreshing mechanism that isn't as powerful as the complete reset
> > > (which may not be practical in many scenarios).
> > >
> >
> > My question about recording the timestamp at which each active topic
> record
> > were (infrequently) written was more about making a bit more information
> > available given the current design, and whether recording a bit more
> > information (for very little additional storage cost and no extra runtime
> > cost) may be worth it if in the future we figure out how to use this
> > information.
> >
> > I think it's more complicated to try to record the history of when topics
> > were most recently used, since that requires recording a lot more active
> > topic records than the current proposal. Besides, it's not unexpected
> that
> > source and sink connectors sometimes don't use topics for periods of
> time.
> > A sink connector only consumes from a topic when there are additional
> > records to consume, and a source connector only needs to write to a topic
> > when there is information in the upstream system targeted to that topic.
> An
> > example of the latter is that most database connectors will write to a
> > topic for a particular table only when rows in that table have changed.
> >
> >
> > > I also agree with Randall's other point (Would it be better to not
> > > automatically reset connector's active topics when a sink connector is
> > > restarted?). I think keeping the behavior as symmetrical between sink
> and
> > > source connectors is a good idea.
> > >
> > > Lastly, with regards to the API, I can imagine it is also pretty useful
> > to
> > > answer the inverse question: "which connectors write to topic X".
> Perhaps
> > > we can achieve this by letting the users compute it and just expose an
> > API
> > > that returns the entire mapping at once (instead of needing to call the
> > > /connectors/{name}/topics endpoint for each connector).
> >
> >
> > It may be worth considering a method such as /topics that would produce a
> > result that is an aggregate of the potentially many
> > /connectors/{name}/topic responses, especially if the result of the
> /topics
> > is an array of the individual responses from /connectors/{name}/topic.
> The
> > benefit of a single request is that the answer to "which connectors use
> > topic X" can be computed using simple tools such as 'jq'. And, if tooling
> > frequently fetches the active topic information for all connectors,
> > providing the aggregate method would reduce the load on the tooling and
> the
> > server. It may also be relatively easy to implement.
> >
> >
> > > Otherwise, looks good to me! Hits the requirements that I had in mind
> on
> > > the nose.
> > > - Almog
> > >
> > > On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley <tbentley@redhat.com>
> wrote:
> > >
> > > > Hi Konstantine,
> > > >
> > > > Thanks for the KIP, I can see how it could be useful.
> > > >
> > > > a) Did you consider using a metric for this? I don't think it would
> > > satisfy
> > > > all the use cases you have in mind, but you could mention it in the
> > > > rejected alternatives.
> > > >
> > > > b) If the topic name contains the string "-connector" then the key
> > format
> > > > is ambiguous. This isn't necessarily fatal because the value will
> > > > disambiguate, but it could be misleading. Any reason not to just use
> a
> > > JSON
> > > > key, and simplify the value?
> > > >
> > > > c) I didn't understand this part: "As soon as a worker detects the
> > > addition
> > > > of a topic to a connector's set of active topics, the worker will
> cease
> > > to
> > > > post update messages to the status.storage.topic for that connector.
> ".
> > > I'm
> > > > sure I've overlooking something but why is this necessary? Is this
> were
> > > the
> > > > task id in the value is used?
> > > >
> > > > Thanks again,
> > > >
> > > > Tom
> > > >
> > > > On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch <rhauch@gmail.com>
> > wrote:
> > > >
> > > > > Oh, one more thing:
> > > > >
> > > > > 9. There's no mention of how the status topic is partitioned, or
> how
> > > > > partitioning will be used by the new topic records. The KIP should
> > > > probably
> > > > > outline this for clarity and completeness.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Randall
> > > > >
> > > > > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch <rhauch@gmail.com>
> > > wrote:
> > > > >
> > > > > > Thanks, Konstantine. Overall, this KIP looks interesting and
> really
> > > > > > useful, and for the most part is spot on. I do have a number
of
> > > > > > questions/comments about specifics:
> > > > > >
> > > > > >    1. The topic records have a value that includes the connector
> > > name,
> > > > > >    task number that last reported the topic is used, and the
> topic
> > > > name.
> > > > > >    There's no mention of record timestamps, but I wonder if
it'd
> be
> > > > > useful to
> > > > > >    record this. One challenge might be that a connector does
not
> > > write
> > > > > to a
> > > > > >    topic for a while or the task remains running for long periods
> > of
> > > > > time and
> > > > > >    therefore the worker doesn't record that this topic has been
> > newly
> > > > > written
> > > > > >    to since it the task was restarted. IOW, the semantics of
the
> > > > > timestamp may
> > > > > >    be a bit murky. Have you thought about recording the
> timestamp,
> > > and
> > > > > if so
> > > > > >    what are the pros and cons?
> > > > > >    - The "Recording active topics" section says the following:
> > > > > >       "As soon as a worker detects the addition of a topic to
a
> > > > > >       connector's set of active topics, all the connector's
tasks
> > > that
> > > > > inspect
> > > > > >       source or sink records will cease to post update messages
> to
> > > the
> > > > > >       status.storage.topic."
> > > > > >       This probably means the timestamp won't be very useful.
> > > > > >    2. The KIP says "the Kafka record value stores the ID of
the
> > task
> > > > that
> > > > > >    succeeded to store a topic status record last." However,
this
> > is a
> > > > bit
> > > > > >    unclear: is it really storing the last task that successfully
> > > wrote
> > > > > to that
> > > > > >    topic (as this would require very frequent writes to this
> > topic),
> > > or
> > > > > is it
> > > > > >    more that this is the task that was last *recorded* as having
> > > > written
> > > > > >    to the topic? (Here, "recorded" could be a bit of a gray
area,
> > > since
> > > > > this
> > > > > >    would depend on the how the worker periodically records this
> > > > > information.)
> > > > > >    Any kind of clarity here might be helpful.
> > > > > >    3. In the "Recording active topics" section (and the
> surrounding
> > > > > >    sections), the "task" is used ambiguously. For example, "when
> > its
> > > > > tasks
> > > > > >    start processing their first records ... these tasks will
> start
> > > > > inspecting
> > > > > >    which is the Kafka topic of each of these records". IIUC,
the
> > > first
> > > > > "task"
> > > > > >    mentioned is the connector's task, and the second is the
> > worker's
> > > > > task. Do
> > > > > >    we need to distinguish this more clearly?
> > > > > >    4. Maybe I missed it, but does this KIP explicitly say that
> the
> > > > > >    Connector API is unchanged? It's probably worth pointing
out
> to
> > > help
> > > > > >    assuage any concerns that connector implementations have
to
> > change
> > > > to
> > > > > make
> > > > > >    use of this feature.
> > > > > >    5. In the "Resetting a connector's set of active topics"
> section
> > > the
> > > > > >    behavior is not exactly clear. Consider a user running
> connector
> > > > "A",
> > > > > the
> > > > > >    connector has been fully started and is processing records,
> and
> > > the
> > > > > worker
> > > > > >    has recorded topic usage records. Then the user resets the
> > active
> > > > > topics
> > > > > >    for connector A while the connector is still running? If
the
> > > > connector
> > > > > >    writes to no new topics, before the tasks are rebalanced
then
> is
> > > it
> > > > > correct
> > > > > >    that Connect would report no active topics? And after the
> tasks
> > > are
> > > > > >    rebalance, will the worker record any topics used by connector
> > A?
> > > > > >    6. In the "Restaring" (misspelled) section: "Reconfiguring
a
> > > source
> > > > > >    connector has also no altering effect for a source connector.
> > > > > However, when
> > > > > >    reconfiguring a sink connector if the new configuration no
> > longer
> > > > > includes
> > > > > >    any of the previously tracked topics, these topics will be
> > removed
> > > > > from the
> > > > > >    set of active topics for this sink connector by appending
> > > tombstone
> > > > > >    messages appropriately after the reconfiguration of the
> > > connector."
> > > > > Would
> > > > > >    it be better to not automatically reset connector's active
> > topics
> > > > > when a
> > > > > >    sink connector is restarted? Isn't that more consistent with
> the
> > > > > >    "Resetting" behavior and the goals at the top of the KIP:
> "it'd
> > be
> > > > > useful
> > > > > >    for users, operators and applications to know which are the
> > topics
> > > > > that a
> > > > > >    connector has used since it was first created"?
> > > > > >    7. The `PUT /connectors/{name}/topics/reset` endpoint "this
> > > request
> > > > > >    can be reapplied after the deletion of the connector". IOW,
> even
> > > > > though
> > > > > >    connector with that name doesn't exist, we can still make
this
> > > > > request? How
> > > > > >    does this compare with other methods such as "status"?
> > > > > >    8. What are the security implications of this proposal?
> > > > > >
> > > > > > As you can see, most of these can probably be addressed without
> > much
> > > > > work.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Randall
> > > > > >
> > > > > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis <
> > > > > > konstantine@confluent.io> wrote:
> > > > > >
> > > > > >> Hi all.
> > > > > >>
> > > > > >> I just posted KIP-558: Track the set of actively used topics
by
> > > > > connectors
> > > > > >> in Kafka Connect
> > > > > >>
> > > > > >> Wiki link here:
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect
> > > > > >>
> > > > > >> I think it's a nice extension to follow up on KIP-158 and
a
> useful
> > > > > feature
> > > > > >> to the ever increasing number of applications that are built
> > around
> > > > > Kafka
> > > > > >> Connect.
> > > > > >> Would love to hear what you think.
> > > > > >>
> > > > > >> Best,
> > > > > >> Konstantine
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message