From user-return-19198-archive-asf-public=cust-asf.ponee.io@flink.apache.org Thu Apr 5 15:11:17 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C12E818063B for ; Thu, 5 Apr 2018 15:11:16 +0200 (CEST) Received: (qmail 72864 invoked by uid 500); 5 Apr 2018 13:11:10 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 72850 invoked by uid 99); 5 Apr 2018 13:11:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Apr 2018 13:11:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 55CCEC00CA for ; Thu, 5 Apr 2018 13:11:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.471 X-Spam-Level: *** X-Spam-Status: No, score=3.471 tagged_above=-999 required=6.31 tests=[FORGED_HOTMAIL_RCVD2=1.187, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.972, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 32LY7MRR3_YY for ; Thu, 5 Apr 2018 13:11:06 +0000 (UTC) Received: from n4.nabble.com (n4.nabble.com [162.253.133.72]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 5B9E45F3BA for ; Thu, 5 Apr 2018 13:11:06 +0000 (UTC) Received: from mben.nabble.com (localhost [127.0.0.1]) by n4.nabble.com (Postfix) with ESMTP id CF96D19DAFE1A for ; Thu, 5 Apr 2018 06:10:58 -0700 (MST) Date: Thu, 5 Apr 2018 06:10:58 -0700 (MST) From: Edward To: user@flink.apache.org Message-ID: <1522933858847-0.post@n4.nabble.com> In-Reply-To: <1496441048905-13468.post@n4.nabble.com> References: <1496266618219-13411.post@n4.nabble.com> <1496318402596-13419.post@n4.nabble.com> <1496321214825-13422.post@n4.nabble.com> <1496441048905-13468.post@n4.nabble.com> Subject: Re: Checkpoints very slow with high backpressure MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit I read through this thread and didn't see any resolution to the slow checkpoint issue (just that someone resolved their backpressure issue). We are experiencing the same problem: - When there is no backpressure, checkpoints take less than 100ms - When there is high backpressure, checkpoints take anywhere from 5 minutes to 25 minutes. This is preventing us from using the checkpointing feature at all, since periodic backpressure is unavoidable. We are experiencing this when running on Flink 1.4.0. We are retaining only a single checkpoint, and the size of retained checkpoint is less than 250KB, so there's not a lot of state. state.backend: jobmanager state.backend.async: true state.backend.fs.checkpointdir: hdfs://checkpoints state.checkpoints.num-retained: 1 max concurrent checkpoints: 1 checkpointing mode: AT_LEAST_ONCE One other data point: if I rewrite the job to allow chaining all steps (i.e. same parallelism on all steps, so they fit in 1 task slot), the checkpoints are still slow under backpressure, but are an order of magnitude faster -- they take about 60 seconds rather than 15 minutes. -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/