From dev-return-97185-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Thu Aug 16 02:34:07 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 9C6F9180626 for ; Thu, 16 Aug 2018 02:34:06 +0200 (CEST) Received: (qmail 33710 invoked by uid 500); 16 Aug 2018 00:34:05 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 33698 invoked by uid 99); 16 Aug 2018 00:34:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Aug 2018 00:34:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 29343C1A71 for ; Thu, 16 Aug 2018 00:34:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.122 X-Spam-Level: X-Spam-Status: No, score=-0.122 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_FONT_FACE_BAD=0.289, HTML_MESSAGE=2, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_DKIMWL_WL_HIGH=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=zendesk.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 16suqcK-SIqO for ; Thu, 16 Aug 2018 00:33:59 +0000 (UTC) Received: from us-smtp-delivery-110.mimecast.com (us-smtp-delivery-110.mimecast.com [63.128.21.110]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 832825F416 for ; Thu, 16 Aug 2018 00:33:59 +0000 (UTC) Received: from mail-wm0-f69.google.com (mail-wm0-f69.google.com [74.125.82.69]) (Using TLS) by us-smtp-1.mimecast.com with ESMTP id us-mta-98-qAN0YW2VP_WxbUdl1Jh15Q-1; Wed, 15 Aug 2018 20:33:51 -0400 Received: by mail-wm0-f69.google.com with SMTP id f10-v6so1556465wmb.9 for ; Wed, 15 Aug 2018 17:33:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=CrQnydHM6Ja7U6kLwQSCiBfqDe6B4RcV41gkutILUmI=; b=IdrMesqybN7BNFEUTsSjuTJ7cFtaS5sFhOS3ZtAYiyg9I2JKXD9O533Llpb6eEN2kg PbbNZa2/hD8VvIk8sahaUMhlxRaD8RgDcQITyseKlj5jAKhbkX/DRH6Spmduz7/fAZ1b CWt0f1ldrflJpQfetYGGnNHTJ8r+irfy7of92gXAJ7bE5i+wAKFRMwkC1J1iFhXXjaMx EKc8AJ/lxEqlJi7EQrLEE+CzxdaVdwZq1vzYvozMYwS3jTBcj3Qw4Occ5A0W+epIzDEM Zs++W7UoFiFY1MYxtnB3Dk5MmeErOqAtht80JHS9KCDqNvafP4p2QGWAlN20E8Xkr6Qx DbsQ== X-Gm-Message-State: AOUpUlEfAvo7FFy6jSXZ8/ipUGrEfD5pqf1s2QN6ILJbnYmjHhK+lipa 0dSHjdXAKrwo9qS4JUQxoHEc715MVzZASH994tRZmkcGyQ3A5XIK27ko2Riu8DbFEAbBPF8uLpV Ag0fmCZVe5fOz3wscG3Mp9XWzIslk2arCrdKmyr9OwWRa X-Received: by 2002:a1c:ec9d:: with SMTP id h29-v6mr15313119wmi.94.1534379629986; Wed, 15 Aug 2018 17:33:49 -0700 (PDT) X-Google-Smtp-Source: AA+uWPzD2KY0M9q4GNr5ohjrRc8/za73xd6XgtFISPmEsnr96Co9bItByKgVR+aFOpHKOrgVpaBj0uw5gjgk8BAAnVU= X-Received: by 2002:a1c:ec9d:: with SMTP id h29-v6mr15313108wmi.94.1534379629677; Wed, 15 Aug 2018 17:33:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Brett Rann Date: Thu, 16 Aug 2018 10:33:38 +1000 Message-ID: Subject: Re: [DISCUSS] KIP-354 Time-based log compaction policy To: dev@kafka.apache.org X-MC-Unique: qAN0YW2VP_WxbUdl1Jh15Q-1 Content-Type: multipart/alternative; boundary="000000000000c741bb057382986b" --000000000000c741bb057382986b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable We've been looking into this too. Mailing list: https://lists.apache.org/thread.html/ed7f6a6589f94e8c2a705553f364ef599cb691= 5e4c3ba9b561e610e4@%3Cdev.kafka.apache.org%3E jira wish: https://issues.apache.org/jira/browse/KAFKA-7137 confluent slack discussion: https://confluentcommunity.slack.com/archives/C49R61XMM/p1530760121000039 A person on my team has started on code so you might want to coordinate: https://github.com/dongxiaohe/kafka/tree/dongxiaohe/log-cleaner-compaction-= max-lifetime-2.0 He's been working with Jason Gustafson and James Chen around the changes. You can ping him on confluent slack as Xiaohe Dong. It's great to know others are thinking on it as well. You've added the requirement to force a segment roll which we hadn't gotten to yet, which is great. I was content with it not including the active segment. > Adding topic level configuration "max.compaction.lag.ms", and corresponding broker configuration "log.cleaner.max.compaction.lag.ms", which is set to 0 (disabled) by default. Glancing at some other settings convention seems to me to be -1 for disabled (or infinite, which is more meaningful here). 0 to me implies instant, a little quicker than 1. We've been trying to think about a way to trigger compaction as well through an API call, which would need to be flagged somewhere (ZK admin/ space?) but we're struggling to think how that would be coordinated across brokers and partitions. Have you given any thought to that? On Thu, Aug 16, 2018 at 8:44 AM xiongqi wu wrote: > Eno, Dong, > > I have updated the KIP. We decide not to address the issue that we might > have for both compaction and time retention enabled topics (see the > rejected alternative item 2). This KIP will only ensure log can be > compacted after a specified time-interval. > > As suggested by Dong, we will also enforce "max.compaction.lag.ms" is not > less than "min.compaction.lag.ms". > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-354: Time-based log > compaction policy > > > > On Tue, Aug 14, 2018 at 5:01 PM, xiongqi wu wrote: > > > > > Per discussion with Dong, he made a very good point that if compaction > > and time based retention are both enabled on a topic, the compaction > might > > prevent records from being deleted on time. The reason is when compacti= ng > > multiple segments into one single segment, the newly created segment wi= ll > > have same lastmodified timestamp as latest original segment. We lose th= e > > timestamp of all original segments except the last one. As a result, > > records might not be deleted as it should be through time based > retention. > > > > With the current KIP proposal, if we want to ensure timely deletion, we > > have the following configurations: > > 1) enable time based log compaction only : deletion is done though > > overriding the same key > > 2) enable time based log retention only: deletion is done though > > time-based retention > > 3) enable both log compaction and time based retention: Deletion is not > > guaranteed. > > > > Not sure if we have use case 3 and also want deletion to happen on time= . > > There are several options to address deletion issue when enable both > > compaction and retention: > > A) During log compaction, looking into record timestamp to delete expir= ed > > records. This can be done in compaction logic itself or use > > AdminClient.deleteRecords() . But this assumes we have record timestamp= . > > B) retain the lastModifed time of original segments during log > compaction. > > This requires extra meta data to record the information or not grouping > > multiple segments into one during compaction. > > > > If we have use case 3 in general, I would prefer solution A and rely on > > record timestamp. > > > > > > Two questions: > > Do we have use case 3? Is it nice to have or must have? > > If we have use case 3 and want to go with solution A, should we introdu= ce > > a new configuration to enforce deletion by timestamp? > > > > > > On Tue, Aug 14, 2018 at 1:52 PM, xiongqi wu wrote= : > > > >> Dong, > >> > >> Thanks for the comment. > >> > >> There are two retention policy: log compaction and time based retentio= n. > >> > >> Log compaction: > >> > >> we have use cases to keep infinite retention of a topic (only > >> compaction). GDPR cares about deletion of PII (personal identifiable > >> information) data. > >> Since Kafka doesn't know what records contain PII, it relies on upper > >> layer to delete those records. > >> For those infinite retention uses uses, kafka needs to provide a way t= o > >> enforce compaction on time. This is what we try to address in this KIP= . > >> > >> Time based retention, > >> > >> There are also use cases that users of Kafka might want to expire all > >> their data. > >> In those cases, they can use time based retention of their topics. > >> > >> > >> Regarding your first question, if a user wants to delete a key in the > >> log compaction topic, the user has to send a deletion using the same > key. > >> Kafka only makes sure the deletion will happen under a certain time > >> periods (like 2 days/7 days). > >> > >> Regarding your second question. In most cases, we might want to delete > >> all duplicated keys at the same time. > >> Compaction might be more efficient since we need to scan the log and > find > >> all duplicates. However, the expected use case is to set the time base= d > >> compaction interval on the order of days, and be larger than 'min > >> compaction lag". We don't want log compaction to happen frequently sin= ce > >> it is expensive. The purpose is to help low production rate topic to g= et > >> compacted on time. For the topic with "normal" incoming message messag= e > >> rate, the "min dirty ratio" might have triggered the compaction before > this > >> time based compaction policy takes effect. > >> > >> > >> Eno, > >> > >> For your question, like I mentioned we have long time retention use ca= se > >> for log compacted topic, but we want to provide ability to delete > certain > >> PII records on time. > >> Kafka itself doesn't know whether a record contains sensitive > information > >> and relies on the user for deletion. > >> > >> > >> On Mon, Aug 13, 2018 at 6:58 PM, Dong Lin wrote: > >> > >>> Hey Xiongqi, > >>> > >>> Thanks for the KIP. I have two questions regarding the use-case for > >>> meeting > >>> GDPR requirement. > >>> > >>> 1) If I recall correctly, one of the GDPR requirement is that we can > not > >>> keep messages longer than e.g. 30 days in storage (e.g. Kafka). Say > there > >>> exists a partition p0 which contains message1 with key1 and message2 > with > >>> key2. And then user keeps producing messages with key=3Dkey2 to this > >>> partition. Since message1 with key1 is never overridden, sooner or > later > >>> we > >>> will want to delete message1 and keep the latest message with key=3Dk= ey2. > >>> But > >>> currently it looks like log compact logic in Kafka will always put > these > >>> messages in the same segment. Will this be an issue? > >>> > >>> 2) The current KIP intends to provide the capability to delete a give= n > >>> message in log compacted topic. Does such use-case also require Kafka > to > >>> keep the messages produced before the given message? If yes, then we > can > >>> probably just use AdminClient.deleteRecords() or time-based log > retention > >>> to meet the use-case requirement. If no, do you know what is the GDPR= 's > >>> requirement on time-to-deletion after user explicitly requests the > >>> deletion > >>> (e.g. 1 hour, 1 day, 7 day)? > >>> > >>> Thanks, > >>> Dong > >>> > >>> > >>> On Mon, Aug 13, 2018 at 3:44 PM, xiongqi wu > wrote: > >>> > >>> > Hi Eno, > >>> > > >>> > The GDPR request we are getting here at linkedin is if we get a > >>> request to > >>> > delete a record through a null key on a log compacted topic, > >>> > we want to delete the record via compaction in a given time period > >>> like 2 > >>> > days (whatever is required by the policy). > >>> > > >>> > There might be other issues (such as orphan log segments under > certain > >>> > conditions) that lead to GDPR problem but they are more like > >>> something we > >>> > need to fix anyway regardless of GDPR. > >>> > > >>> > > >>> > -- Xiongqi (Wesley) Wu > >>> > > >>> > On Mon, Aug 13, 2018 at 2:56 PM, Eno Thereska < > eno.thereska@gmail.com> > >>> > wrote: > >>> > > >>> > > Hello, > >>> > > > >>> > > Thanks for the KIP. I'd like to see a more precise definition of > what > >>> > part > >>> > > of GDPR you are targeting as well as some sort of verification th= at > >>> this > >>> > > KIP actually addresses the problem. Right now I find this a bit > >>> vague: > >>> > > > >>> > > "Ability to delete a log message through compaction in a timely > >>> manner > >>> > has > >>> > > become an important requirement in some use cases (e.g., GDPR)" > >>> > > > >>> > > > >>> > > Is there any guarantee that after this KIP the GDPR problem is > >>> solved or > >>> > do > >>> > > we need to do something else as well, e.g., more KIPs? > >>> > > > >>> > > > >>> > > Thanks > >>> > > > >>> > > Eno > >>> > > > >>> > > > >>> > > > >>> > > On Thu, Aug 9, 2018 at 4:18 PM, xiongqi wu > >>> wrote: > >>> > > > >>> > > > Hi Kafka, > >>> > > > > >>> > > > This KIP tries to address GDPR concern to fulfill deletion > request > >>> on > >>> > > time > >>> > > > through time-based log compaction on a compaction enabled topic= : > >>> > > > > >>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > >>> > > > 354%3A+Time-based+log+compaction+policy > >>> > > > > >>> > > > Any feedback will be appreciated. > >>> > > > > >>> > > > > >>> > > > Xiongqi (Wesley) Wu > >>> > > > > >>> > > > >>> > > >>> > >> > >> > >> > >> -- > >> Xiongqi (Wesley) Wu > >> > > > > > > > > -- > > Xiongqi (Wesley) Wu > > > > > > -- > Xiongqi (Wesley) Wu > --=20 Brett Rann Senior DevOps Engineer Zendesk International Ltd 395 Collins Street, Melbourne VIC 3000 Australia Mobile: +61 (0) 418 826 017 --000000000000c741bb057382986b--