Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8673E200B58 for ; Wed, 27 Jul 2016 10:56:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 837C2160A90; Wed, 27 Jul 2016 08:56:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AA714160A6E for ; Wed, 27 Jul 2016 10:56:21 +0200 (CEST) Received: (qmail 54013 invoked by uid 500); 27 Jul 2016 08:56:20 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 54003 invoked by uid 99); 27 Jul 2016 08:56:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jul 2016 08:56:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id DB7AA1A109B for ; Wed, 27 Jul 2016 08:56:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.721 X-Spam-Level: X-Spam-Status: No, score=-0.721 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=zalando-de.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id okTpq9ZCC-8g for ; Wed, 27 Jul 2016 08:56:37 +0000 (UTC) Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id BADC860E25 for ; Wed, 27 Jul 2016 08:56:17 +0000 (UTC) Received: by mail-wm0-f42.google.com with SMTP id q128so203320354wma.1 for ; Wed, 27 Jul 2016 01:56:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zalando-de.20150623.gappssmtp.com; s=20150623; h=from:subject:date:message-id:to:mime-version; bh=3kQhwxKC4xrs85/frppR/98ILZMxWVqf4yAaQZW+bVs=; b=zshHZlKoUj15HJwTiqloqUwDa1yMArl44M/vRkIeTgdxXB4wCre2xd+HTjUPR+xukL CeJMWuSwt/p1Jf7nSYSl/l/e8zUCGEmWTCxjfuznfzJw639IaEa0giwu2zdkl7p/Ribl qCOjK2KCoeEn8HpaS5pMvpaJyhW04+I9p0Un4sTtd7scQMHi/qduMlzGLyBGGYjEgOth JuPNZ8Zm4I41rbcD4KQAAN2al6NFlHKT9/pH541tbcEotHqwR+mf1m/YJR7xrbJvN2SH bD9xAMGD/FJXsbBIngF2SRsqj+ZvMJl5zm2RjSTszQZ3OO5X3WLK5W5ccd9UDjnuRqkF rkbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:subject:date:message-id:to:mime-version; bh=3kQhwxKC4xrs85/frppR/98ILZMxWVqf4yAaQZW+bVs=; b=ApLgTUfCEAMoXJgji6FW4b97mDdYdwzfg6IvrvjcS3KdIU5Y6D7VoyZOS0JwJfHlVp 8whPBNxfbjN6l9Ve6X8CjE2DRWQK0ailBaTVU085AU2jdg3n5uRx6tpuyZxOkQK7r/D4 7b0LEVLsSpb+ADBqLhjsvOvvm8/tgt+q9yaEygeACEdJeB29g+e4vqFa0BCZ8bbIKwGY 8n/LEAx6lGVeZGL78IGJUZe+nnX+BuCQYmKeO2wquf+QBnooSDkqGfN0fHzylGEmH+a/ Q+VGEUlSDtlJ8wQ6oJuw8YSPLslow6pvXaJfSjM1eWIkV/6JHmRHCsxlOvgilFBQtLPU kXiA== X-Gm-Message-State: AEkoouu7C/q6M6tMfk9ALo0YutZi2icclLkqePpavAz/MkoMj8uQSW/6B0emRVetTbJMfAjP X-Received: by 10.194.31.68 with SMTP id y4mr22147645wjh.149.1469609776405; Wed, 27 Jul 2016 01:56:16 -0700 (PDT) Received: from [10.169.51.1] ([94.135.236.149]) by smtp.gmail.com with ESMTPSA id xa2sm5273680wjc.0.2016.07.27.01.56.15 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 27 Jul 2016 01:56:15 -0700 (PDT) From: Gary Yao X-Pgp-Agent: GPGMail Content-Type: multipart/signed; boundary="Apple-Mail=_DA32048C-2D7C-46A3-A87B-2FD6670F6D46"; protocol="application/pgp-signature"; micalg=pgp-sha512 Subject: Sporadic exceptions when checkpointing to S3 Date: Wed, 27 Jul 2016 10:56:09 +0200 Message-Id: <31E0A916-7F5B-45EE-B5D8-D4022DF53928@zalando.de> To: user@flink.apache.org Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) archived-at: Wed, 27 Jul 2016 08:56:22 -0000 --Apple-Mail=_DA32048C-2D7C-46A3-A87B-2FD6670F6D46 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi all, I am using the filesystem state backend with checkpointing to S3. =46rom the JobManager logs, I can see that it works most of the time, = e.g., 2016-07-26 17:49:07,311 INFO = org.apache.flink.runtime.checkpoint.CheckpointCoordinator - = Triggering checkpoint 3 @ 1469555347310 2016-07-26 17:49:11,128 INFO = org.apache.flink.runtime.checkpoint.CheckpointCoordinator - = Completed checkpoint 3 (in 3335 ms) However, taking the checkpoint fails with the following exception from time to time: 2016-07-26 17:50:07,310 INFO = org.apache.flink.runtime.checkpoint.CheckpointCoordinator - = Triggering checkpoint 4 @ 1469555407310 2016-07-26 17:50:12,225 INFO = org.apache.flink.runtime.executiongraph.ExecutionGraph - = TriggerWindow(SlidingEventTimeWindows(3600000, 1000), = ListStateDescriptor{name=3Dwindow-contents, defaultValue=3Dnull, = serializer=3Dorg.apache.flink.api.java.typeutils.runtime.PojoSerializer@10= 3b8046}, EventTimeTrigger(), = WindowedStream.apply(WindowedStream.java:226)) -> Sink: Unnamed (1/1) = (0ec242b46c49039f673dc902fd983f49) switched from RUNNING to FAILED 2016-07-26 17:50:12,227 INFO = org.apache.flink.runtime.jobmanager.JobManager - Status = of job bd2930a4d6e7cf8d04d3bbafe22e386b ([...]) changed to FAILING. java.lang.RuntimeException: Error triggering a checkpoint as the result = of receiving checkpoint barrier #011at = org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.j= ava:701) #011at = org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.j= ava:691) #011at = org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(Barrier= Buffer.java:203) #011at = org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(Barr= ierBuffer.java:129) #011at = org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(St= reamInputProcessor.java:175) #011at = org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputSt= reamTask.java:65) #011at = org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java= :225) #011at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) #011at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Failed to fetch state handle size #011at = org.apache.flink.runtime.taskmanager.RuntimeEnvironment.acknowledgeCheckpo= int(RuntimeEnvironment.java:234) #011at = org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(Stre= amTask.java:528) #011at = org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.j= ava:695) #011... 8 more Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Status = Code: 403, AWS Service: Amazon S3, AWS Request ID: [...], AWS Error = Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: [...] #011at = com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.j= ava:798) #011at = com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:42= 1) #011at = com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) #011at = com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) #011at = com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.= java:976) #011at = com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.= java:956) #011at = org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:89= 2) #011at = org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77= ) #011at = org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.getFileStatus(HadoopFile= System.java:351) #011at = org.apache.flink.runtime.state.filesystem.AbstractFileStateHandle.getFileS= ize(AbstractFileStateHandle.java:93) #011at = org.apache.flink.runtime.state.filesystem.FileStreamStateHandle.getStateSi= ze(FileStreamStateHandle.java:58) #011at = org.apache.flink.runtime.state.AbstractStateBackend$DataInputViewHandle.ge= tStateSize(AbstractStateBackend.java:428) #011at = org.apache.flink.streaming.runtime.tasks.StreamTaskStateList.getStateSize(= StreamTaskStateList.java:77) #011at = org.apache.flink.runtime.taskmanager.RuntimeEnvironment.acknowledgeCheckpo= int(RuntimeEnvironment.java:231) #011... 10 more All logs are from the machine running the JobManager. The status code suggests that this is a problem with permissions. However, I can see checkpoints stored correctly in the configured S3 bucket. Also, sometimes old checkpoints are not removed. Does anybody here experience the same problems? Can it be that S3 is flaky? Find below my configuration: Flink 1.0.3 libs/ aws-java-sdk-1.7.4.jar hadoop-aws-2.7.2.jar httpclient-4.2.5.jar httpcore-4.2.5.jar contents of core-site.xml fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem hadoop.tmp.dir /data/tmp/hadoop fs.s3a.attempts.maximum 10 fs.s3a.endpoint s3-eu-west-1.amazonaws.com Best, Gary --Apple-Mail=_DA32048C-2D7C-46A3-A87B-2FD6670F6D46 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - https://gpgtools.org iQIcBAEBCgAGBQJXmHcuAAoJEEcziuPvkUSOJ2cP/3bcUOLWjtyRaoJHWarxhSKo tV0ap5qASd+qALTFl7e5SWiOiR+MEdTw2SGPrT8J2V1mln5NfHCjlCcy20eqze+G KaACVe04cUQwYhjFArvkCYvjWWuHMqqo+77ooVnSc1zhrQD9EjzLT1lYkQfY1d0w 87iX0nTuyLptGrUAYVouGmfl63EWhES7dKJAEd2+N7bWXmsSHP7TxfL78yGRHcQ8 Ytgb8iipN0nc83+eBxW5KvuhpBDQEmW4zOPTFHAyg4Y+ZWG/dz7f9WY0SKUGx/XV sxydHwMxXkfWRBW//I0s8AjKdK+/NCyuTdILwhUbK3gLmmPkGEwVRX+7XrdOz+FS s8g+VMgU7xcCoVoWo6r09XlYVB6WCOsOhVfeTQBI+KvWd9YvgDBlHsC3GViMJf5P dJk+/7a7xTAIdDPojiTx5N29EMD1w5jQ3OF4edtrZ/pCe4/X08itqMHlyQ0G6F1A b8n+xsoSxpECGBwre+rCPxlY3RTsEyDtBxus/cZugAe9ktwjNUjuQ5lIvWrR/XtH aSz247/eTcKd6wLxrJQr77MFmUALKx+sfE4dyVWnsvTU5s9DBMC6rdVKl5BeGW3k VsxlKFJ9BYYSZ7eI6XW80ujX2FmeWisuxq+GbB2U6WyGGEQOtxkoXDbppnIsakky UssAGlZ08PVVhy/4E3ho =xB0w -----END PGP SIGNATURE----- --Apple-Mail=_DA32048C-2D7C-46A3-A87B-2FD6670F6D46--