Return-Path: X-Original-To: apmail-kafka-dev-archive@www.apache.org Delivered-To: apmail-kafka-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DD0B81023B for ; Thu, 29 Aug 2013 00:10:30 +0000 (UTC) Received: (qmail 71586 invoked by uid 500); 29 Aug 2013 00:10:30 -0000 Delivered-To: apmail-kafka-dev-archive@kafka.apache.org Received: (qmail 71550 invoked by uid 500); 29 Aug 2013 00:10:30 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 71542 invoked by uid 99); 29 Aug 2013 00:10:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2013 00:10:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of luke.forehand@networkedinsights.com designates 207.46.163.157 as permitted sender) Received: from [207.46.163.157] (HELO na01-bn1-obe.outbound.protection.outlook.com) (207.46.163.157) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Aug 2013 00:10:24 +0000 Received: from BN1PR07MB118.namprd07.prod.outlook.com (10.242.216.11) by BN1PR07MB119.namprd07.prod.outlook.com (10.242.216.15) with Microsoft SMTP Server (TLS) id 15.0.745.25; Wed, 28 Aug 2013 23:59:46 +0000 Received: from BN1PR07MB118.namprd07.prod.outlook.com ([169.254.5.150]) by BN1PR07MB118.namprd07.prod.outlook.com ([169.254.5.150]) with mapi id 15.00.0745.000; Wed, 28 Aug 2013 23:59:45 +0000 From: Luke Forehand To: "dev@kafka.apache.org" Subject: RE: replicas have different earliest offset Thread-Topic: replicas have different earliest offset Thread-Index: AQHOpDcHWnYiqCGytE6Lqzmc/TAUdZmrM8QAgAAQ6iA= Date: Wed, 28 Aug 2013 23:59:44 +0000 Message-ID: <88801f6d99d84618847ba8edf6b66b6a@BN1PR07MB118.namprd07.prod.outlook.com> References: <0533e656835b48fa9e8eaef39a62d6ad@BN1PR07MB118.namprd07.prod.outlook.com>, In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [24.181.174.214] x-forefront-prvs: 09525C61DB x-forefront-antispam-report: SFV:NSPM;SFS:(164054003)(51704005)(377454003)(189002)(199002)(55674002)(24454002)(46102001)(74366001)(83072001)(19580395003)(74316001)(74706001)(47736001)(4396001)(49866001)(80976001)(56816003)(54356001)(77096001)(53806001)(74502001)(31966008)(74662001)(47446002)(59766001)(77982001)(81686001)(76576001)(76786001)(76482001)(79102001)(63696002)(76796001)(56776001)(54316002)(81816001)(65816001)(66066001)(80022001)(51856001)(83322001)(19580405001)(69226001)(74876001)(47976001)(15975445006)(33646001)(81342001)(50986001)(81542001)(24736002);DIR:OUT;SFP:;SCL:1;SRVR:BN1PR07MB119;H:BN1PR07MB118.namprd07.prod.outlook.com;CLIP:24.181.174.214;RD:InfoNoRecords;MX:1;A:1;LANG:en; Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: networkedinsights.com X-Virus-Checked: Checked by ClamAV on apache.org Jay, great information thank you. I am in a testing phase so I have been c= ontinually resetting the commit offsets of my consumers before re-running c= onsumer performance tests. I realize now my retention policy was set as 7 = days, and I had added 3 new brokers at day 5 and reassigned partitions to t= hese new brokers. So it seems the partitions owned by original broker 0 ha= ve rolled, but the re-assignment of partitions to brokers 1,2,3 have reset = the period of the retention policy for these partitions. For sake of bette= r consistency, maybe the current stats of the retention policy could be sen= t to the new broker during the partition reassignment. That way, partition= s on brokers 1,2,3 would roll at roughly the same time as the partitions on= broker 0. Although like you said its a lower bound and perhaps not that i= mportant (just slightly confusing when a noob is trying to spot check the v= alidity of a replica). In the meantime I will disable the retention policy= and start consuming at an offset that is in the range of all replicas. Th= ank you again!=0A= =0A= Luke Forehand | NetworkedInsights.com | Software Engineer=0A= =0A= ________________________________________=0A= From: Jay Kreps =0A= Sent: Wednesday, August 28, 2013 5:29 PM=0A= To: dev@kafka.apache.org=0A= Subject: Re: replicas have different earliest offset=0A= =0A= On a single server our retention window is always approximate and a lower= =0A= bound on what is retained since we only discard full partitions at a time.= =0A= That is if you say you want to retain 100GB and have a 1GB partition size= =0A= we will discard the last partition when doing so would not bring the=0A= retained data below 100GB (and similarly with time).=0A= =0A= Between servers no attempt is made to synchronize the discard of data. That= =0A= is, it is likely that all replicas will discard at roughly the same time=0A= but this is purely a local computation for each of them. Since it is=0A= approximate and a lower bound it does not seem useful to try to synchronize= =0A= this further.=0A= =0A= If your consumers are bumping up against the retention window so close that= =0A= they may actually be falling off that is a problem. Indeed even in the=0A= absence of leader change it is likely that if you are lagging this much you= =0A= will eventually fall off the end of the retention window on the leader. So= =0A= this is either a problem of retention being too small (double it) or the=0A= consumer being fundamentally unable to keep up (in which case no amount of= =0A= retention will help).=0A= =0A= -Jay=0A= =0A= =0A= On Wed, Aug 28, 2013 at 2:51 PM, Luke Forehand <=0A= luke.forehand@networkedinsights.com> wrote:=0A= =0A= > I'm running into strange behavior when testing failure scenarios. I have= =0A= > 4 brokers and 8 partitions for a topic called "feed". I wrote a piece of= =0A= > code that prints out the partitionId, leaderId, and earliest offset for= =0A= > each partition.=0A= >=0A= > Here is the printed information about partition leader earliest offsets:= =0A= >=0A= > partition:0 leader:0 offset: 1676913=0A= > partition:1 leader:1 offset: 0=0A= > partition:2 leader:2 offset: 0=0A= > partition:3 leader:0 offset: 1676760=0A= > partition:4 leader:0 offset: 1676635=0A= > partition:5 leader:1 offset: 0=0A= > partition:6 leader:2 offset: 0=0A= > partition:7 leader:0 offset: 1676101=0A= >=0A= > I then kill broker 0 (using kill ) and re-run my program=0A= >=0A= > partition:0 leader:1 offset: 0=0A= > partition:1 leader:1 offset: 0=0A= > partition:2 leader:2 offset: 0=0A= > partition:3 leader:3 offset: 0=0A= > partition:4 leader:1 offset: 0=0A= > partition:5 leader:1 offset: 0=0A= > partition:6 leader:2 offset: 0=0A= > partition:7 leader:1 offset: 0=0A= >=0A= > As you can see the leaders have changed where the leader was broker 0.=0A= > However the earliest offset has also changed. I was under the impressio= n=0A= > that a replica must have the same offset range otherwise it would confuse= =0A= > the consumer of the partition. For example I run into an issue where=0A= > during a failover test my consumer tries to request an offset into a=0A= > partition on the new leader but the offset didn't exist (it was earlier= =0A= > than the earliest offset in that partition). Can anybody explain what is= =0A= > happening?=0A= >=0A= > Here is my code that prints the leader partition offset information:=0A= > https://gist.github.com/lukeforehand/c37e22aea7192e00fff5=0A= >=0A= > Thanks,=0A= > Luke=0A= >=0A= >=0A= >=0A=