Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: Telling if a job has caught up with Kafka
From: =?utf-8?Q?Florian_K=C3=B6nig?= <florian.koenig@micardo.com>
In-Reply-To: <CA+faj9ybBOT_f6S5reQ-vET6v4LR0d2o7KsGfWJFxQ1BX+7FRg@mail.gmail.com>
Date: Fri, 17 Mar 2017 11:07:19 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <0104ACD3-1407-4A59-8D65-67CA7326F1C6@micardo.com>
References: <CA+faj9ybBOT_f6S5reQ-vET6v4LR0d2o7KsGfWJFxQ1BX+7FRg@mail.gmail.com>
To: user@flink.apache.org
archived-at: Fri, 17 Mar 2017 10:07:39 -0000

Hi,

thank you Gyula for posting that question. I=E2=80=99d also be =
interested in how this could be done.

You mentioned the dependency on the commit frequency. I=E2=80=99m using =
https://github.com/quantifind/KafkaOffsetMonitor. With the 08 Kafka =
consumer a job's offsets as shown in the diagrams updated a lot more =
regularly than the checkpointing interval. With the 10 consumer a commit =
is only made after a successful checkpoint (or so it seems).

Why is that so? The checkpoint contains the Kafka offset and would be =
able to start reading wherever it left off, regardless of any offset =
stored in Kafka or Zookeeper. Why is the offset not committed regularly, =
independently from the checkpointing? Or did I misconfigure anything?

Thanks
Florian

> Am 17.03.2017 um 10:26 schrieb Gyula F=C3=B3ra <gyfora@apache.org>:
>=20
> Hi All,
>=20
> I am wondering if anyone has some nice suggestions on what would be =
the simplest/best way of telling if a job is caught up with the Kafka =
input.
> An alternative question would be how to tell if a job is caught up to =
another job reading from the same topic.
>=20
> The first thing that comes to my mind is looking at the offsets Flink =
commits to Kafka. However this will only work if every job uses a =
different group id and even then it is not very reliable depending on =
the commit frequency.
>=20
> The use case I am trying to solve is fault tolerant update of a job, =
by taking a savepoint for job1 starting job2 from the savepoint, waiting =
until it catches up and then killing job1.
>=20
> Thanks for your input!
> Gyula