Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@kafka.apache.org
Received-SPF: pass (nike.apache.org: domain of
 luke.forehand@networkedinsights.com designates 207.46.163.157 as permitted
 sender)
From: Luke Forehand <luke.forehand@networkedinsights.com>
To: "dev@kafka.apache.org" <dev@kafka.apache.org>
Subject: RE: replicas have different earliest offset
Thread-Topic: replicas have different earliest offset
Thread-Index: AQHOpDcHWnYiqCGytE6Lqzmc/TAUdZmrM8QAgAAQ6iA=
Date: Wed, 28 Aug 2013 23:59:44 +0000
Message-ID: 
 <88801f6d99d84618847ba8edf6b66b6a@BN1PR07MB118.namprd07.prod.outlook.com>
References: 
 <0533e656835b48fa9e8eaef39a62d6ad@BN1PR07MB118.namprd07.prod.outlook.com>,<CAOeJiJhVAa7vw8aMFyobcB9=G9D6SjjHqfV4O0vmC46jiwTqSQ@mail.gmail.com>
In-Reply-To: 
 <CAOeJiJhVAa7vw8aMFyobcB9=G9D6SjjHqfV4O0vmC46jiwTqSQ@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Jay, great information thank you.  I am in a testing phase so I have been c=
ontinually resetting the commit offsets of my consumers before re-running c=
onsumer performance tests.  I realize now my retention policy was set as 7 =
days, and I had added 3 new brokers at day 5 and reassigned partitions to t=
hese new brokers.  So it seems the partitions owned by original broker 0 ha=
ve rolled, but the re-assignment of partitions to brokers 1,2,3 have reset =
the period of the retention policy for these partitions.  For sake of bette=
r consistency, maybe the current stats of the retention policy could be sen=
t to the new broker during the partition reassignment.  That way, partition=
s on brokers 1,2,3 would roll at roughly the same time as the partitions on=
 broker 0.  Although like you said its a lower bound and perhaps not that i=
mportant (just slightly confusing when a noob is trying to spot check the v=
alidity of a replica).  In the meantime I will disable the retention policy=
 and start consuming at an offset that is in the range of all replicas.  Th=
ank you again!=0A=
=0A=
Luke Forehand | NetworkedInsights.com | Software Engineer=0A=
=0A=
________________________________________=0A=
From: Jay Kreps <jay.kreps@gmail.com>=0A=
Sent: Wednesday, August 28, 2013 5:29 PM=0A=
To: dev@kafka.apache.org=0A=
Subject: Re: replicas have different earliest offset=0A=
=0A=
On a single server our retention window is always approximate and a lower=
=0A=
bound on what is retained since we only discard full partitions at a time.=
=0A=
That is if you say you want to retain 100GB and have a 1GB partition size=
=0A=
we will discard the last partition when doing so would not bring the=0A=
retained data below 100GB (and similarly with time).=0A=
=0A=
Between servers no attempt is made to synchronize the discard of data. That=
=0A=
is, it is likely that all replicas will discard at roughly the same time=0A=
but this is purely a local computation for each of them. Since it is=0A=
approximate and a lower bound it does not seem useful to try to synchronize=
=0A=
this further.=0A=
=0A=
If your consumers are bumping up against the retention window so close that=
=0A=
they may actually be falling off that is a problem. Indeed even in the=0A=
absence of leader change it is likely that if you are lagging this much you=
=0A=
will eventually fall off the end of the retention window on the leader. So=
=0A=
this is either a problem of retention being too small (double it) or the=0A=
consumer being fundamentally unable to keep up (in which case no amount of=
=0A=
retention will help).=0A=
=0A=
-Jay=0A=
=0A=
=0A=
On Wed, Aug 28, 2013 at 2:51 PM, Luke Forehand <=0A=
luke.forehand@networkedinsights.com> wrote:=0A=
=0A=
> I'm running into strange behavior when testing failure scenarios.  I have=
=0A=
> 4 brokers and 8 partitions for a topic called "feed".  I wrote a piece of=
=0A=
> code that prints out the partitionId, leaderId, and earliest offset for=
=0A=
> each partition.=0A=
>=0A=
> Here is the printed information about partition leader earliest offsets:=
=0A=
>=0A=
> partition:0 leader:0 offset: 1676913=0A=
> partition:1 leader:1 offset: 0=0A=
> partition:2 leader:2 offset: 0=0A=
> partition:3 leader:0 offset: 1676760=0A=
> partition:4 leader:0 offset: 1676635=0A=
> partition:5 leader:1 offset: 0=0A=
> partition:6 leader:2 offset: 0=0A=
> partition:7 leader:0 offset: 1676101=0A=
>=0A=
> I then kill broker 0 (using kill <pid>) and re-run my program=0A=
>=0A=
> partition:0 leader:1 offset: 0=0A=
> partition:1 leader:1 offset: 0=0A=
> partition:2 leader:2 offset: 0=0A=
> partition:3 leader:3 offset: 0=0A=
> partition:4 leader:1 offset: 0=0A=
> partition:5 leader:1 offset: 0=0A=
> partition:6 leader:2 offset: 0=0A=
> partition:7 leader:1 offset: 0=0A=
>=0A=
> As you can see the leaders have changed where the leader was broker 0.=0A=
>  However the earliest offset has also changed.  I was under the impressio=
n=0A=
> that a replica must have the same offset range otherwise it would confuse=
=0A=
> the consumer of the partition.  For example I run into an issue where=0A=
> during a failover test my consumer tries to request an offset into a=0A=
> partition on the new leader but the offset didn't exist (it was earlier=
=0A=
> than the earliest offset in that partition).  Can anybody explain what is=
=0A=
> happening?=0A=
>=0A=
> Here is my code that prints the leader partition offset information:=0A=
> https://gist.github.com/lukeforehand/c37e22aea7192e00fff5=0A=
>=0A=
> Thanks,=0A=
> Luke=0A=
>=0A=
>=0A=
>=0A=