From dev-return-110534-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Thu Jan 16 00:14:46 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id DAAAE18065E for ; Thu, 16 Jan 2020 01:14:45 +0100 (CET) Received: (qmail 135 invoked by uid 500); 16 Jan 2020 00:14:44 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 121 invoked by uid 99); 16 Jan 2020 00:14:43 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jan 2020 00:14:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 5F16DC0557 for ; Thu, 16 Jan 2020 00:14:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.001 X-Spam-Level: X-Spam-Status: No, score=0.001 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent.io Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id iD8HTed7i0cP for ; Thu, 16 Jan 2020 00:14:39 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::b41; helo=mail-yb1-xb41.google.com; envelope-from=konstantine@confluent.io; receiver= Received: from mail-yb1-xb41.google.com (mail-yb1-xb41.google.com [IPv6:2607:f8b0:4864:20::b41]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id C125C7DDE8 for ; Thu, 16 Jan 2020 00:14:38 +0000 (UTC) Received: by mail-yb1-xb41.google.com with SMTP id z15so3765316ybm.8 for ; Wed, 15 Jan 2020 16:14:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=pX3RBvA+WnEGbKSdZNsf7n/aL/z7h3TPOdyZwcQkmMw=; b=YOzCy4L9Irz0RRdlLK+8klSQXy4fRpYEPFDuh6WVCHI9g0VCXRYMh3VMUivEPF7iCT qnaUukCLFqZKj4W8jhB3DWHnFVER9BlVpfl2vL+bhmYRXCRVqx/qSc9KDbpAG+Al1iOI TzMgvUjWfCzVVAxhBKCI/zKUEH2By3cLWfLQ8iLaU+P41hxyzSTekxlNjtGI75k4dG6g WEc2ZRH2h+2TotJi+XjldvK64C+5AR+SYjrl+kUhipaMYyX9nUDRG//9WCoEqO7/FeKj Sw6cd0cajV1g6vk2UgYhCIQcRKF2X8ORQS4Q1+dvDC8hYANbygbHoUh+O6f12KLclwwV bI2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=pX3RBvA+WnEGbKSdZNsf7n/aL/z7h3TPOdyZwcQkmMw=; b=GeW8GO2O7l+K6gSBahSkaKNwnHhZi9FYQoCdweN5AAobKMD8cj+w1RujuKKWc3UcPL fudyxpDRXLwiTEZy4Rr53EJtAO1fhy7ZE5Q6/RPfK/au2AFxFWeGOHVOOzWTGyRVBLEE W8U+0OfAjeCthjK5iaj1arnsEtS3Kd4MFWL+Fq9fkx7UXfH2llB2iHkNC96esH93I1t9 1ImP9FKF3uxNga4twECD4z/8Oj/h5qwbL9wT4hk34EA+QkLI6JNhqYWyGfLAGWisFv9u bsEpdOKP6IIIrtyVpHQV2M4HmjJIae/LgNRuQaZygA8FR35MiX9B5McPiB5Y2zKJXJaE +wdA== X-Gm-Message-State: APjAAAUtN6GGu2skl9TzB7SxXcu4QPVyzF3DjGncXTiJDiLUdTKdtidI SBOnaRTBNHBMyxc4fvxV8wpZiCZoMLXGSl/3+WEWO+/JZD0= X-Google-Smtp-Source: APXvYqzvzC3hl12w0eTCWpc8i8HUowC+mM2ykdkxX0ddCkPmRuvsqoxLq6t4oxE7WloIWb27qOUdc5b8gCb5D+kbZeQ= X-Received: by 2002:a25:c04f:: with SMTP id c76mr22785067ybf.355.1579133676997; Wed, 15 Jan 2020 16:14:36 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Konstantine Karantasis Date: Wed, 15 Jan 2020 16:14:25 -0800 Message-ID: Subject: Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="000000000000def47f059c36b570" --000000000000def47f059c36b570 Content-Type: text/plain; charset="UTF-8" Hey Almog, thanks for the comments! Here's my take: 1) I think that an approximate grouping of topics to highly-active/active/inactive (because a precise one would be too expensive) seems like something we could leave out of this first version of topic tracking. Interestingly, as you point out, given a topic of continuous flow, you may get a similar view by resetting and subsequently querying the topics endpoint. The way I understand Randall's comment, is that persisting timestamps would have to do more with knowing approximately the first time this connector has been using its topics. But given the implications that topic retention might have, I suggest that we leave timestamp recording out at this point. 2) As I described in my previous answer above (#6 on my first reply to Randall's comments), this asymmetry is inherited from the sink connector's configuration. I'm ok not resetting the topics for both upon reconfiguration, since this will result in simpler code. But I'd like to know that we are ok with a sink connector showing topics in its active set that are not in its current configuration (unless a reset request is issued). 3) A similar idea crossed my mind while I was thinking how the "/connectors" endpoint evolved with KIP-465 to show a roll up of the status of the tasks of all the connectors. However, here what you describe would probably require an additional top-level "/topics" endpoint and a more complex filtering based on permissions. I'd suggest punting this feature, unless people think that is a really nice to have. In the meantime, as you mention, it is something that can be constructed with consecutive queries after an applications gets the list of connectors running in the Connect cluster. Cheers, Konstantine On Wed, Jan 15, 2020 at 2:41 PM Randall Hauch wrote: > On Wed, Jan 15, 2020 at 4:36 PM Randall Hauch wrote: > > > On Wed, Jan 15, 2020 at 2:05 PM Konstantine Karantasis < > > konstantine@confluent.io> wrote: > > > >> > >> 9. I assumed that partitioning is implied by default, because there's no > >> requirement for complete ordering of topic status records. But I'll add > >> this fact as a separate bullet. The status.storage.topic is already a > >> partitioned topic. > >> > > > > Agreed. I think it'd be sufficient to simply mention that partition will > > be chosen based upon the active topic records' keys, ensuring that all > > active topic records for the same connector will be written to the same > > partition and will be totally ordered. > > > > Well, my previous statement is not quite right. All topic records for the > same *connector and topic* will be written to the same partition and will > be totally ordered. But as you pointed out, it doesn't really matter, other > than that this feature will work with any # of partitions. The new bullet > you described would be sufficient. :-D > > > > > >> > >> I'm following up with the rest of the comments, shortly. > >> Thanks, > >> Konstantine > >> > >> > >> On Wed, Jan 15, 2020 at 9:19 AM Almog Gavra wrote: > >> > >> > Hi Konstantine, > >> > > >> > Thanks for the KIP! This is going to make automatic integration with > >> > Connect much more powerful. > >> > > >> > My thoughts are mostly around freshness of the data and being able to > >> > expose that to users. Riffing on Randall's timestamp question - have > we > >> > considered adding some interval at which point a connector will > >> republish > >> > any topics that it encounters and update the timestamp? That way we > have > >> > some refreshing mechanism that isn't as powerful as the complete reset > >> > (which may not be practical in many scenarios). > >> > > >> > I also agree with Randall's other point (Would it be better to not > >> > automatically reset connector's active topics when a sink connector is > >> > restarted?). I think keeping the behavior as symmetrical between sink > >> and > >> > source connectors is a good idea. > >> > > >> > Lastly, with regards to the API, I can imagine it is also pretty > useful > >> to > >> > answer the inverse question: "which connectors write to topic X". > >> Perhaps > >> > we can achieve this by letting the users compute it and just expose an > >> API > >> > that returns the entire mapping at once (instead of needing to call > the > >> > /connectors/{name}/topics endpoint for each connector). > >> > > >> > Otherwise, looks good to me! Hits the requirements that I had in mind > on > >> > the nose. > >> > - Almog > >> > > >> > On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley > >> wrote: > >> > > >> > > Hi Konstantine, > >> > > > >> > > Thanks for the KIP, I can see how it could be useful. > >> > > > >> > > a) Did you consider using a metric for this? I don't think it would > >> > satisfy > >> > > all the use cases you have in mind, but you could mention it in the > >> > > rejected alternatives. > >> > > > >> > > b) If the topic name contains the string "-connector" then the key > >> format > >> > > is ambiguous. This isn't necessarily fatal because the value will > >> > > disambiguate, but it could be misleading. Any reason not to just > use a > >> > JSON > >> > > key, and simplify the value? > >> > > > >> > > c) I didn't understand this part: "As soon as a worker detects the > >> > addition > >> > > of a topic to a connector's set of active topics, the worker will > >> cease > >> > to > >> > > post update messages to the status.storage.topic for that connector. > >> ". > >> > I'm > >> > > sure I've overlooking something but why is this necessary? Is this > >> were > >> > the > >> > > task id in the value is used? > >> > > > >> > > Thanks again, > >> > > > >> > > Tom > >> > > > >> > > On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch > >> wrote: > >> > > > >> > > > Oh, one more thing: > >> > > > > >> > > > 9. There's no mention of how the status topic is partitioned, or > how > >> > > > partitioning will be used by the new topic records. The KIP should > >> > > probably > >> > > > outline this for clarity and completeness. > >> > > > > >> > > > Best regards, > >> > > > > >> > > > Randall > >> > > > > >> > > > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch > >> > wrote: > >> > > > > >> > > > > Thanks, Konstantine. Overall, this KIP looks interesting and > >> really > >> > > > > useful, and for the most part is spot on. I do have a number of > >> > > > > questions/comments about specifics: > >> > > > > > >> > > > > 1. The topic records have a value that includes the connector > >> > name, > >> > > > > task number that last reported the topic is used, and the > topic > >> > > name. > >> > > > > There's no mention of record timestamps, but I wonder if it'd > >> be > >> > > > useful to > >> > > > > record this. One challenge might be that a connector does not > >> > write > >> > > > to a > >> > > > > topic for a while or the task remains running for long > periods > >> of > >> > > > time and > >> > > > > therefore the worker doesn't record that this topic has been > >> newly > >> > > > written > >> > > > > to since it the task was restarted. IOW, the semantics of the > >> > > > timestamp may > >> > > > > be a bit murky. Have you thought about recording the > timestamp, > >> > and > >> > > > if so > >> > > > > what are the pros and cons? > >> > > > > - The "Recording active topics" section says the following: > >> > > > > "As soon as a worker detects the addition of a topic to a > >> > > > > connector's set of active topics, all the connector's > tasks > >> > that > >> > > > inspect > >> > > > > source or sink records will cease to post update messages > to > >> > the > >> > > > > status.storage.topic." > >> > > > > This probably means the timestamp won't be very useful. > >> > > > > 2. The KIP says "the Kafka record value stores the ID of the > >> task > >> > > that > >> > > > > succeeded to store a topic status record last." However, this > >> is a > >> > > bit > >> > > > > unclear: is it really storing the last task that successfully > >> > wrote > >> > > > to that > >> > > > > topic (as this would require very frequent writes to this > >> topic), > >> > or > >> > > > is it > >> > > > > more that this is the task that was last *recorded* as having > >> > > written > >> > > > > to the topic? (Here, "recorded" could be a bit of a gray > area, > >> > since > >> > > > this > >> > > > > would depend on the how the worker periodically records this > >> > > > information.) > >> > > > > Any kind of clarity here might be helpful. > >> > > > > 3. In the "Recording active topics" section (and the > >> surrounding > >> > > > > sections), the "task" is used ambiguously. For example, "when > >> its > >> > > > tasks > >> > > > > start processing their first records ... these tasks will > start > >> > > > inspecting > >> > > > > which is the Kafka topic of each of these records". IIUC, the > >> > first > >> > > > "task" > >> > > > > mentioned is the connector's task, and the second is the > >> worker's > >> > > > task. Do > >> > > > > we need to distinguish this more clearly? > >> > > > > 4. Maybe I missed it, but does this KIP explicitly say that > the > >> > > > > Connector API is unchanged? It's probably worth pointing out > to > >> > help > >> > > > > assuage any concerns that connector implementations have to > >> change > >> > > to > >> > > > make > >> > > > > use of this feature. > >> > > > > 5. In the "Resetting a connector's set of active topics" > >> section > >> > the > >> > > > > behavior is not exactly clear. Consider a user running > >> connector > >> > > "A", > >> > > > the > >> > > > > connector has been fully started and is processing records, > and > >> > the > >> > > > worker > >> > > > > has recorded topic usage records. Then the user resets the > >> active > >> > > > topics > >> > > > > for connector A while the connector is still running? If the > >> > > connector > >> > > > > writes to no new topics, before the tasks are rebalanced then > >> is > >> > it > >> > > > correct > >> > > > > that Connect would report no active topics? And after the > tasks > >> > are > >> > > > > rebalance, will the worker record any topics used by > connector > >> A? > >> > > > > 6. In the "Restaring" (misspelled) section: "Reconfiguring a > >> > source > >> > > > > connector has also no altering effect for a source connector. > >> > > > However, when > >> > > > > reconfiguring a sink connector if the new configuration no > >> longer > >> > > > includes > >> > > > > any of the previously tracked topics, these topics will be > >> removed > >> > > > from the > >> > > > > set of active topics for this sink connector by appending > >> > tombstone > >> > > > > messages appropriately after the reconfiguration of the > >> > connector." > >> > > > Would > >> > > > > it be better to not automatically reset connector's active > >> topics > >> > > > when a > >> > > > > sink connector is restarted? Isn't that more consistent with > >> the > >> > > > > "Resetting" behavior and the goals at the top of the KIP: > >> "it'd be > >> > > > useful > >> > > > > for users, operators and applications to know which are the > >> topics > >> > > > that a > >> > > > > connector has used since it was first created"? > >> > > > > 7. The `PUT /connectors/{name}/topics/reset` endpoint "this > >> > request > >> > > > > can be reapplied after the deletion of the connector". IOW, > >> even > >> > > > though > >> > > > > connector with that name doesn't exist, we can still make > this > >> > > > request? How > >> > > > > does this compare with other methods such as "status"? > >> > > > > 8. What are the security implications of this proposal? > >> > > > > > >> > > > > As you can see, most of these can probably be addressed without > >> much > >> > > > work. > >> > > > > > >> > > > > Best regards, > >> > > > > > >> > > > > Randall > >> > > > > > >> > > > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis < > >> > > > > konstantine@confluent.io> wrote: > >> > > > > > >> > > > >> Hi all. > >> > > > >> > >> > > > >> I just posted KIP-558: Track the set of actively used topics by > >> > > > connectors > >> > > > >> in Kafka Connect > >> > > > >> > >> > > > >> Wiki link here: > >> > > > >> > >> > > > >> > >> > > > > >> > > > >> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect > >> > > > >> > >> > > > >> I think it's a nice extension to follow up on KIP-158 and a > >> useful > >> > > > feature > >> > > > >> to the ever increasing number of applications that are built > >> around > >> > > > Kafka > >> > > > >> Connect. > >> > > > >> Would love to hear what you think. > >> > > > >> > >> > > > >> Best, > >> > > > >> Konstantine > >> > > > >> > >> > > > > > >> > > > > >> > > > >> > > >> > > > --000000000000def47f059c36b570--