Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5BCD2200CC8 for ; Fri, 14 Jul 2017 09:56:21 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5A44B16D688; Fri, 14 Jul 2017 07:56:21 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4F77116D686 for ; Fri, 14 Jul 2017 09:56:20 +0200 (CEST) Received: (qmail 78940 invoked by uid 500); 14 Jul 2017 07:56:19 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 78929 invoked by uid 99); 14 Jul 2017 07:56:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 07:56:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 860C41A0742 for ; Fri, 14 Jul 2017 07:56:18 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id zOAqMtabPWpi for ; Fri, 14 Jul 2017 07:56:16 +0000 (UTC) Received: from mail-ua0-f180.google.com (mail-ua0-f180.google.com [209.85.217.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 72ABC5FB61 for ; Fri, 14 Jul 2017 07:56:16 +0000 (UTC) Received: by mail-ua0-f180.google.com with SMTP id 35so605105uax.3 for ; Fri, 14 Jul 2017 00:56:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ZJf28v1X4ptl+2YNO+ixMBcTNP/4l7Z9MOSSZL0cpLw=; b=NgUv9DzkrSulHMfuM9cvTfDhM+4/bVCd79O3AyGoy7jNTLtWYx/i7coa6JM8dHVF6m I/AaDs9uZpCf5sugLuFkLR02KuSJvx7k1pQbjQmFGDg2CdY1mO930FdSVVJIMZnmYDNm Slzkb/VArvV2a1rwAD4P5goPsL7whfpyMhoGszN7q5zo0satIbJh6+lHpkKdDY6HI7Zi zJahu6rDy9sTJXbVeZWaMP8tqn9+u7M3hzF1kLXKyC0tVIIk3EufpBIVIXxQIx8u5/I6 VmGZkChU9rlCvUo4y/RYmAuBQd9lt0B806PXUziUEmdzESFM+tDenz/6275tgp5Ud7Jj UzLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ZJf28v1X4ptl+2YNO+ixMBcTNP/4l7Z9MOSSZL0cpLw=; b=VsOJ1YUuZ7kQU7kZpaaDXXbcRx7EPQ8L0uPhgoaxprAFLAdttZymzIDzqH3tDkVTxV qSDec4KUKtuGSDy7nM8mYJov/uc1Mc+FzALWYAMV9Uf3jr8m69YumudXVO3qeY1tkrvp nuy+yWkmP3AmA424A8a0gg8tmXbzbFW8AaTTjg8CvXW5n9eH/zGm3zjJp3ApuUWCvnFK J0rC/FqD/QL/WI9Gz4t8tNsXs01YXshX7ORMQqSBtSF8F8aU5WkhV6XzicBtQe/PH2Tg bP4Ms14FExwwt3PFrk5ygcwKAg5xjddLG7eNJ96RdUYCL7O5wmBAQ8zJgRd6XTjfKwA3 ur6A== X-Gm-Message-State: AIVw113ulGs6nNwD2+lxqRTWc9QU91BJGvSpdV9n71UZqNwAJSkWyIDG Jtw5smZUP/kcAwwfgZyRr91llYqexw== X-Received: by 10.159.53.45 with SMTP id o42mr5082544uao.84.1500018976104; Fri, 14 Jul 2017 00:56:16 -0700 (PDT) MIME-Version: 1.0 References: <4C7E8CF7-18B3-4A4A-834B-AB9AFBD442DB@data-artisans.com> In-Reply-To: From: =?UTF-8?Q?Gyula_F=C3=B3ra?= Date: Fri, 14 Jul 2017 07:56:05 +0000 Message-ID: Subject: Re: Why would a kafka source checkpoint take so long? To: Stephan Ewen Cc: user Content-Type: multipart/alternative; boundary="94eb2c03cd5e3a2877055442627a" archived-at: Fri, 14 Jul 2017 07:56:21 -0000 --94eb2c03cd5e3a2877055442627a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, I have seen this again yesterday, now with some logging it looks like acquiring the lock took all the time. In this case it was pretty clear that the job started falling behind a few minutes before starting the checkpoint so backpressure seems to be the culprit. Thanks, Gyula Stephan Ewen ezt =C3=ADrta (id=C5=91pont: 2017. j=C3=BAl= . 12., Sze, 15:27): > Can it be that the checkpoint thread is waiting to grab the lock, which i= s > held by the chain under backpressure? > > On Wed, Jul 12, 2017 at 12:23 PM, Gyula F=C3=B3ra = wrote: > >> Yes thats definitely what I am about to do next but just thought maybe >> someone has seen this before. >> >> Will post info next time it happens. (Not guaranteed to happen soon as i= t >> didn't happen for a long time before) >> >> Gyula >> >> On Wed, Jul 12, 2017, 12:13 Stefan Richter >> wrote: >> >>> Hi, >>> >>> could you introduce some logging to figure out from which method call >>> the delay is introduced? >>> >>> Best, >>> Stefan >>> >>> Am 12.07.2017 um 11:37 schrieb Gyula F=C3=B3ra : >>> >>> Hi, >>> >>> We are using the latest 1.3.1 >>> >>> Gyula >>> >>> Urs Schoenenberger ezt =C3=ADrta (id= =C5=91pont: >>> 2017. j=C3=BAl. 12., Sze, 10:44): >>> >>>> Hi Gyula, >>>> >>>> I don't know the cause unfortunately, but we observed a similiar issue >>>> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1. >>>> Which version are you running on? >>>> >>>> Urs >>>> >>>> On 12.07.2017 09:48, Gyula F=C3=B3ra wrote: >>>> > Hi, >>>> > >>>> > I have noticed a strange behavior in one of our jobs: every once in = a >>>> while >>>> > the Kafka source checkpointing time becomes extremely large compared >>>> to >>>> > what it usually is. (To be very specific it is a kafka source chaine= d >>>> with >>>> > a stateless map operator) >>>> > >>>> > To be more specific checkpointing the offsets usually takes around >>>> 10ms >>>> > which sounds reasonable but in some checkpoints this goes into the 3= -5 >>>> > minutes range practically blocking the job for that period of time. >>>> > Yesterday I have observed even 10 minute delays. First I thought tha= t >>>> some >>>> > sources might trigger checkpoints later than others, but adding some >>>> > logging and comparing it it seems that the triggerCheckpoint was >>>> received >>>> > at the same time. >>>> > >>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>> > affected (last time I checked at least). We are still using the 0.8 >>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>> other >>>> > jobs. >>>> > >>>> > Any clue on what might cause this? >>>> > >>>> > Thanks :) >>>> > Gyula >>>> > >>>> > >>>> > >>>> > Hi, >>>> > >>>> > I have noticed a strange behavior in one of our jobs: every once in = a >>>> > while the Kafka source checkpointing time becomes extremely large >>>> > compared to what it usually is. (To be very specific it is a kafka >>>> > source chained with a stateless map operator) >>>> > >>>> > To be more specific checkpointing the offsets usually takes around >>>> 10ms >>>> > which sounds reasonable but in some checkpoints this goes into the 3= -5 >>>> > minutes range practically blocking the job for that period of time. >>>> > Yesterday I have observed even 10 minute delays. First I thought tha= t >>>> > some sources might trigger checkpoints later than others, but adding >>>> > some logging and comparing it it seems that the triggerCheckpoint wa= s >>>> > received at the same time. >>>> > >>>> > Interestingly only one of the 3 kafka sources in the job seems to be >>>> > affected (last time I checked at least). We are still using the 0.8 >>>> > consumer with commit on checkpoints. Also I dont see this happen in >>>> > other jobs. >>>> > >>>> > Any clue on what might cause this? >>>> > >>>> > Thanks :) >>>> > Gyula >>>> >>>> -- >>>> Urs Sch=C3=B6nenberger - urs.schoenenberger@tngtech.com >>>> >>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring >>>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert= Dahlke >>>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082 >>>> >>> >>> > --94eb2c03cd5e3a2877055442627a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

I have seen this again yesterday, n= ow with some logging it looks like acquiring the lock took all the time. In= this case it was pretty clear that the job started falling behind a few mi= nutes before starting the checkpoint so backpressure seems to be the culpri= t.=C2=A0

Thanks,
Gyula

Stephan Ewen <sewen@apache.org> ezt =C3=ADrta (id=C5=91pont: 2017. j= =C3=BAl. 12., Sze, 15:27):
Can it be that the checkpoint thread is waiting to grab the lock, = which is held by the chain under backpressure?

On Wed, Jul 12, 2017 at= 12:23 PM, Gyula F=C3=B3ra <gyula.fora@gmail.com> wrote:<= br>

Yes thats definitely what = I am about to do next but just thought maybe someone has seen this before.<= /p>

Will post info next time it happens. (Not guaranteed to happ= en soon as it didn't happen for a long time before)

Gyula


On Wed, Jul 12, 2017, 12:13= Stefan Richter <s.richter@data-artisans.com> wrote:
Hi,

<= div>could you introduce some logging to figure out from which method call t= he delay is introduced?

Best,
Stefan

Am 12.07.2017 um 11:37 schrieb Gyula F=C3=B3ra <gyula.fora@gmail.com= >:

Hi,

<= /div>
We are using the latest 1.3.1

Gyula

Urs Schoenenberger &= lt;urs.= schoenenberger@tngtech.com> ezt =C3=ADrta (id=C5=91pont: 2017. j=C3= =BAl. 12., Sze, 10:44):
Hi Gyula,
I don't know the cause unfortunately, but we observed a similiar issue<= br> on Flink 1.1.3. The problem seems to be gone after upgrading to 1.2.1.
Which version are you running on?

Urs

On 12.07.2017 09:48, Gyula F=C3=B3ra wrote:
> Hi,
>
> I have noticed a strange behavior in one of our jobs: every once in a = while
> the Kafka source checkpointing time becomes extremely large compared t= o
> what it usually is. (To be very specific it is a kafka source chained = with
> a stateless map operator)
>
> To be more specific checkpointing the offsets usually takes around 10m= s
> which sounds reasonable but in some checkpoints this goes into the 3-5=
> minutes range practically blocking the job for that period of time. > Yesterday I have observed even 10 minute delays. First I thought that = some
> sources might trigger checkpoints later than others, but adding some > logging and comparing it it seems that the triggerCheckpoint was recei= ved
> at the same time.
>
> Interestingly only one of the 3 kafka sources in the job seems to be > affected (last time I checked at least). We are still using the 0.8 > consumer with commit on checkpoints. Also I dont see this happen in ot= her
> jobs.
>
> Any clue on what might cause this?
>
> Thanks :)
> Gyula
>
>
>
> Hi,
>
> I have noticed a strange behavior in one of our jobs: every once in a<= br> > while the Kafka source checkpointing time becomes extremely large
> compared to what it usually is. (To be very specific it is a kafka
> source chained with a stateless map operator)
>
> To be more specific checkpointing the offsets usually takes around 10m= s
> which sounds reasonable but in some checkpoints this goes into the 3-5=
> minutes range practically blocking the job for that period of time. > Yesterday I have observed even 10 minute delays. First I thought that<= br> > some sources might trigger checkpoints later than others, but adding > some logging and comparing it it seems that the triggerCheckpoint was<= br> > received at the same time.
>
> Interestingly only one of the 3 kafka sources in the job seems to be > affected (last time I checked at least). We are still using the 0.8 > consumer with commit on checkpoints. Also I dont see this happen in > other jobs.
>
> Any clue on what might cause this?
>
> Thanks :)
> Gyula

--
Urs Sch=C3=B6nenberger - urs.schoenenberger@tngtech.com

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert Dahl= ke
Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082


--94eb2c03cd5e3a2877055442627a--