Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 89D20200BA1 for ; Mon, 17 Oct 2016 15:45:01 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 887C9160AE5; Mon, 17 Oct 2016 13:45:01 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 33891160AEC for ; Mon, 17 Oct 2016 15:45:00 +0200 (CEST) Received: (qmail 57517 invoked by uid 500); 17 Oct 2016 13:44:59 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 57505 invoked by uid 99); 17 Oct 2016 13:44:59 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Oct 2016 13:44:59 +0000 Received: from mail-it0-f50.google.com (mail-it0-f50.google.com [209.85.214.50]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id D24A81A00D9 for ; Mon, 17 Oct 2016 13:44:58 +0000 (UTC) Received: by mail-it0-f50.google.com with SMTP id m138so50242543itm.0 for ; Mon, 17 Oct 2016 06:44:58 -0700 (PDT) X-Gm-Message-State: AA6/9RmZcJcpiMbT+BZxhkATnhnocaQcjO1IoagkNFFlylrKFVxvRI7eHkydnwbiXruycwtGeXCScFlQufwsYQ== X-Received: by 10.36.188.196 with SMTP id n187mr9047675ite.5.1476711898000; Mon, 17 Oct 2016 06:44:58 -0700 (PDT) MIME-Version: 1.0 References: <70D9C8C8-B6AE-451A-BC80-8ECE01BBC69D@hyatt.com> In-Reply-To: From: Aljoscha Krettek Date: Mon, 17 Oct 2016 13:44:47 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: job failure with checkpointing enabled To: user@flink.apache.org Content-Type: multipart/alternative; boundary=94eb2c111d221dbb69053f0fc8ce archived-at: Mon, 17 Oct 2016 13:45:01 -0000 --94eb2c111d221dbb69053f0fc8ce Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ok, thanks for the update! Let me know if you run into any more problems. On Mon, 17 Oct 2016 at 14:40 wrote: > HI Aljoscha, > > > > Thanks for the response. > > > > To answer your question, the base path did not exist. But, I think I > found the issue. I believe I had some rogue task managers running. As a > troubleshooting step, I attempted to restart my cluster. However, after > shutting down the cluster I noticed that there were still task managers > running on most of my nodes (and on the master). Interestingly, on a > second attempt to shut down the cluster, I received the message =E2=80=9C= No > taskmanager daemon to stop on host=E2=80=A6=E2=80=9D for each of my nodes= , even though I > could see the flink processes running on these machines. After manually > killing these processes and restarting the cluster, the problem went away= . > > > > So, my assumption is that on a previous attempt to bounce the cluster, > these processes did not shut down cleanly. Starting the cluster after th= at > **may** have resulted in second instances of the task manager running on > most nodes. I=E2=80=99m not certain, however, and I haven=E2=80=99t yet = been able to > reproduce the issue. > > > > > > > > > > > > *From: *Aljoscha Krettek > *Reply-To: *"user@flink.apache.org" > *Date: *Friday, October 14, 2016 at 6:57 PM > *To: *"user@flink.apache.org" > *Subject: *Re: job failure with checkpointing enabled > > > > Hi, > > the file that Flink is trying to create there is not meant to be in the > checkpointing location. It is a local file that is used for buffering > elements until a checkpoint barrier arrives (for certain cases). Can you > check whether the base path where it is trying to create that file exists= ? > For the exception that you posted that would be: > /tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed > > > > Cheers, > > Aljoscha > > > > On Fri, 14 Oct 2016 at 17:37 wrote: > > I recently tried enabling checkpointing in a job (that previously works > w/o checkpointing) and received the following failure on job execution: > > > > java.io.FileNotFoundException: > /tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed/a426eb27761575b3b79e46= 4719bba96e16a1869d85bae292a2ef7eb72fa8a14c.0.buffer > (No such file or directory) > > at java.io.RandomAccessFile.open0(Native Method) > > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > > at java.io.RandomAccessFile.(RandomAccessFile.java:243) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.createSpillingChannel= (BufferSpiller.java:247) > > at > org.apache.flink.streaming.runtime.io.BufferSpiller.(BufferSpiller.= java:117) > > at > org.apache.flink.streaming.runtime.io.BarrierBuffer.(BarrierBuffer.= java:94) > > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.(StreamI= nputProcessor.java:96) > > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInput= StreamTask.java:49) > > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.jav= a:239) > > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584) > > at java.lang.Thread.run(Thread.java:745) > > > > > > The job then restarts and fails again in an endless cycle. > > > > This feels like a configuration issue. My guess is that Flink is looking > for the file above on local storage, though we=E2=80=99ve configured chec= kpointing > to use hdfs (see below). > > > > To enable checkpointing, this is what I did: > > env.enableCheckpointing(3000l); > > > > Relevant configurations in flink-conf.yaml: > > state.backend: filesystem > > state.backend.fs.checkpointdir: > hdfs://myhadoopnamenode:8020/apps/flink/checkpoints > > > > Note, the directory we=E2=80=99ve configured is not the same as the path = indicated > in the error. > > > > Interestingly, there are plenty of subdirs in my checkpoints directory, > these appear to correspond to job start times, even though these jobs don= =E2=80=99t > have checkpointing enabled: > > drwxr-xr-x - rtap hdfs 0 2016-10-13 07:48 > /apps/flink/checkpoints/b4870565f148cff10478dca8bff27bf7 > > drwxr-xr-x - rtap hdfs 0 2016-10-13 08:27 > /apps/flink/checkpoints/044b21a0f252b6142e7ddfee7bfbd7d5 > > drwxr-xr-x - rtap hdfs 0 2016-10-13 08:36 > /apps/flink/checkpoints/a658b23c2d2adf982a2cf317bfb3d3de > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:38 > /apps/flink/checkpoints/1156bd1796105ad95a8625cb28a0b816 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:41 > /apps/flink/checkpoints/58fdd94b7836a3b3ed9abc5c8f3a1dd5 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:43 > /apps/flink/checkpoints/47a849a8ed6538b9e7d3826a628d38b9 > > drwxr-xr-x - rtap hdfs 0 2016-10-14 07:49 > /apps/flink/checkpoints/e6a9e2300ea5c36341fa160adab789f0 > > > > Thanks! > > > > > > > ------------------------------ > > The information contained in this communication is confidential and > intended only for the use of the recipient named above, and may be legall= y > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified th= at > any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please resend it to the sender and delete the original message and copy o= f > it from your computer system. Opinions, conclusions and other information > in this message that do not relate to our official business should be > understood as neither given nor endorsed by the company. > > > ------------------------------ > The information contained in this communication is confidential and > intended only for the use of the recipient named above, and may be legall= y > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified th= at > any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please resend it to the sender and delete the original message and copy o= f > it from your computer system. Opinions, conclusions and other information > in this message that do not relate to our official business should be > understood as neither given nor endorsed by the company. > --94eb2c111d221dbb69053f0fc8ce Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Ok, thanks for the update!

Let me know = if you run into any more problems.

On Mon, 17 Oct 2016 at 14:40 <robert.lancaster@hyatt.com> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">

HI Aljoscha,

=C2=A0

Thanks for the response.=C2=A0

=C2=A0

To answer your question, the base path did n= ot exist.=C2=A0 But, I think I found the issue.=C2=A0 I believe I had some = rogue task managers running.=C2=A0 As a troubleshooting step, I attempted t= o restart my cluster.=C2=A0 However, after shutting down the cluster I noticed that = there were still task managers running on most of my nodes (and on the mast= er).=C2=A0 Interestingly, on a second attempt to shut down the cluster, I r= eceived the message =E2=80=9CNo taskmanager daemon to stop on host=E2=80=A6=E2=80=9D for each of my nodes, even though I coul= d see the flink processes running on these machines.=C2=A0=C2=A0 After manu= ally killing these processes and restarting the cluster, the problem went a= way.

=C2=A0

So, my assumption is that on a previous atte= mpt to bounce the cluster, these processes did not shut down cleanly.=C2=A0= Starting the cluster after that *may* have resu= lted in second instances of the task manager running on most nodes.=C2=A0 I=E2=80=99m not certain, = however, and I haven=E2=80=99t yet been able to reproduce the issue.

=C2=A0

=C2=A0

=C2=A0

=C2=A0

=C2=A0

From: Alj= oscha Krettek <aljoscha@apache.org>
Reply-To: "user@flink.apache.org&qu= ot; <user@flink.apache.org>
Date: Friday, October 14, 2016 at 6:57 PM
To: "user@flink.apache.org" &l= t;user@flink.apache.org>
Subject: Re: job failure with checkpointing enab= led

=C2=A0

Hi,

the file that Flink is trying to create th= ere is not meant to be in the checkpointing location. It is a local file th= at is used for buffering elements until a checkpoint barrier arrives (for c= ertain cases). Can you check whether the base path where it is trying to create that file exists? For the exception that= you posted that would be:=C2=A0/tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed

=C2=A0

Cheers,=

Aljoscha

=C2=A0

On Fri, 14 Oct 2016 at 17:37 <robert.lancaster@hyatt.com> wrote:

I recent= ly tried enabling checkpointing in a job (that previously works w/o checkpo= inting) and received the following failure on job execution:

=C2=A0

java.io.= FileNotFoundException: /tmp/flink-io-202fdf67-3f8c-47dd-8ebc-2265430644ed/a= 426eb27761575b3b79e464719bba96e16a1869d85bae292a2ef7eb72fa8a14c.0.buffer (No such file or directory)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.io.RandomAccessFile.open0(Nati= ve Method)<= /u>

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.io.RandomAccessFile.open(Rando= mAccessFile.java:316)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.io.RandomAccessFile.<init&g= t;(RandomAccessFile.java:243)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= io.BufferSpiller.createSpillingChannel(BufferSpiller.java:247)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= io.BufferSpiller.<init>(BufferSpiller.java:117)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= io.BarrierBuffer.<init>(BarrierBuffer.java:94)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= io.StreamInputProcessor.<init>(StreamInputProcessor.java:96)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= tasks.OneInputStreamTask.init(OneInputStreamTask.java:49)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.streaming.runtime.= tasks.StreamTask.invoke(StreamTask.java:239)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.flink.runtime.taskmanage= r.Task.run(Task.java:584)

=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.lang.Thread.run(Thread.java:74= 5)

=C2=A0

=C2=A0

The job = then restarts and fails again in an endless cycle.

=C2=A0

This fee= ls like a configuration issue.=C2=A0 My guess is that Flink is looking for = the file above on local storage, though we=E2=80=99ve configured checkpointing to use hdfs (see below).=C2=A0

=C2=A0

To enabl= e checkpointing, this is what I did:

env.enab= leCheckpointing(3000l);

=C2=A0

Relevant= configurations in flink-conf.yaml:

state.ba= ckend: filesystem

state.ba= ckend.fs.checkpointdir: hdfs://myhadoopnamenode:8020/apps/flink/checkpoints=

=C2=A0

Note, th= e directory we=E2=80=99ve configured is not the same as the path indicated = in the error.

=C2=A0

Interest= ingly, there are plenty of subdirs in my checkpoints directory, these appea= r to correspond to job start times, even though these jobs don=E2=80=99t have checkpointing enabled:<= u class=3D"gmail_msg">

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-13 07:48 /apps/flink/checkpoints/b4870565f148cff10478dca8b= ff27bf7=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-13 08:27 /apps/flink/checkpoints/044b21a0f252b6142e7ddfee7= bfbd7d5=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-13 08:36 /apps/flink/checkpoints/a658b23c2d2adf982a2cf317b= fb3d3de=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-14 07:38 /apps/flink/checkpoints/1156bd1796105ad95a8625cb2= 8a0b816=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-14 07:41 /apps/flink/checkpoints/58fdd94b7836a3b3ed9abc5c8= f3a1dd5=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-14 07:43 /apps/flink/checkpoints/47a849a8ed6538b9e7d3826a6= 28d38b9=

drwxr-xr= -x=C2=A0=C2=A0 - rtap hdfs=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0 2016-10-14 07:49 /apps/flink/checkpoints/e6a9e2300ea5c36341fa160ad= ab789f0=

=C2=A0

Thanks!<= /span>

=C2=A0

=C2=A0

=C2=A0


The information contained in this = communication is confidential and intended only for the use of the recipien= t named above, and may be legally privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended re= cipient, you are hereby notified that any dissemination, distribution or co= pying of this communication is strictly prohibited. If you have received th= is communication in error, please resend it to the sender and delete the original message and copy of it fro= m your computer system. Opinions, conclusions and other information in this= message that do not relate to our official business should be understood a= s neither given nor endorsed by the company. =



The inf= ormation contained in this communication is confidential and intended only = for the use of the recipient named above, and may be legally privileged and= exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified tha= t any dissemination, distribution or copying of this communication is stric= tly prohibited. If you have received this communication in error, please re= send it to the sender and delete the original message and copy of it from your computer system. Opinions, c= onclusions and other information in this message that do not relate to our = official business should be understood as neither given nor endorsed by the= company.
--94eb2c111d221dbb69053f0fc8ce--