Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9168F200B2B for ; Tue, 28 Jun 2016 11:57:41 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9015B160A06; Tue, 28 Jun 2016 09:57:41 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5924A160A56 for ; Tue, 28 Jun 2016 11:57:40 +0200 (CEST) Received: (qmail 62445 invoked by uid 500); 28 Jun 2016 09:57:34 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 62434 invoked by uid 99); 28 Jun 2016 09:57:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jun 2016 09:57:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0F2DDCC3B9 for ; Tue, 28 Jun 2016 09:57:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id RrSTk_Y9LvVk for ; Tue, 28 Jun 2016 09:57:31 +0000 (UTC) Received: from smtp2.sms.unimo.it (smtp2.sms.unimo.it [155.185.44.12]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id EB9745FAEB for ; Tue, 28 Jun 2016 09:57:30 +0000 (UTC) Received: from mail-vk0-f71.google.com ([209.85.213.71]:36719) by smtp2.sms.unimo.it with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from <74598@studenti.unimore.it>) id 1bHplq-0002bh-1t for user@flink.apache.org; Tue, 28 Jun 2016 11:57:30 +0200 Received: by mail-vk0-f71.google.com with SMTP id j2so28691101vkg.3 for ; Tue, 28 Jun 2016 02:57:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=d3q4xSo6N2i0F2m5d+bI/W0XMCzZYo/r6v2tnrOYGDU=; b=f7hjEJNqwIuObNhVyNj1oLTERkI1rUGZe8LtU/JXPR/cKkl2g7oxDoC0q5+dh0yDNJ T3JD+AAYyuiPpTmBLRj2mGGZp1ewsUhUNGiS53kTxILdjnT0EykFHa5aOvOidbSBIAvA Qp1KJGbLFes6YkEVUrvUZ7/fld4AJ1iXmcyoB/ldZSw8XO1pPH0TRhUsdkkbK4fZANu7 dNwRsDAQy21xbgL2N74Vchv8BdfI2h03VGrWi3aeIIwBM+2dW2ETRN+o8ibmCZ+PjmXI wLtD6bPitrOwAKT1kADwitl6vUnks0JZBUPuniMpVsJ3E+HWVbT++NpLB1U+D/Pcon1K OFJw== X-Gm-Message-State: ALyK8tLkhjhGAwf9XV+aBj/mkUSGXje3o3U9Bl0qY1TXs5cE/S4JnjBQ382i0zRrcZ0q1UpJSS2xGjm518GAHk1uEAp88aiWKMWqp8nlFA8FhTHhi5HcLWv9frsTpyVUiE7TPs7lCtVCOlMwj2ClGioOW2NUKumNeL4= X-Received: by 10.31.97.199 with SMTP id v190mr82370vkb.20.1467107848257; Tue, 28 Jun 2016 02:57:28 -0700 (PDT) X-Received: by 10.31.97.199 with SMTP id v190mr82358vkb.20.1467107847807; Tue, 28 Jun 2016 02:57:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.132.85 with HTTP; Tue, 28 Jun 2016 02:57:27 -0700 (PDT) From: ANDREA SPINA <74598@studenti.unimore.it> Date: Tue, 28 Jun 2016 11:57:27 +0200 Message-ID: Subject: Flink java.io.FileNotFoundException Exception with Peel Framework To: user@flink.apache.org Content-Type: multipart/alternative; boundary=94eb2c094e6e1dc490053653aa0c archived-at: Tue, 28 Jun 2016 09:57:41 -0000 --94eb2c094e6e1dc490053653aa0c Content-Type: text/plain; charset=UTF-8 Hi everyone, I am running some Flink experiments with Peel benchmark http://peel-framework.org/ and I am struggling with exceptions: the environment is a 25-nodes cluster, 16 cores per nodes. The dataset is ~80GiB and is located on Hdfs 2.7.1. At the beginning I tried with 400 as degree of parallelism and with the following configuration: jobmanager.rpc.address = ${runtime.hostname} akka.log.lifecycle.events = ON akka.ask.timeout = 300s jobmanager.rpc.port = 6002 jobmanager.heap.mb = 1024 jobmanager.web.port = 6004 taskmanager.heap.mb = 28672 taskmanager.memory.fraction = 0.7 taskmanager.network.numberOfBuffers = 32768 taskmanager.network.bufferSizeInBytes = 16384 taskmanager.tmp.dirs = "/data/1/peel/flink/tmp:/data/2/peel/flink/tmp:/data/3/peel/flink/tmp:/data/4/peel/flink/tmp" taskmanager.debug.memory.startLogThread = true the following exception will raise Caused by: java.io.IOException: Insufficient number of network buffers: required 350, but only 317 available. The total number of network buffers is currently set to 32768. You can increase this number by setting the configuration key 'taskmanager.network.numberOfBuffers'. at org.apache.flink.runtime.io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.java:196) at org.apache.flink.runtime.io.network.NetworkEnvironment.registerTask(NetworkEnvironment.java:327) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:469) at java.lang.Thread.run(Thread.java:745) So I tried different solutions, both with increasing numberOfBuffers (Max value tried 98304) or decreasing the degreeOfParallelism (Min value tried 300) and testing those configs with a dummy dataset seems to solve the number of buffers issue. But In each case with the 80GiB dataset now I struggle with a new exception; the following with a degree of parallelism = 300 and numberOfBuffers = 32768 org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed. at org.apache.flink.client.program.Client.runBlocking(Client.java:381) at org.apache.flink.client.program.Client.runBlocking(Client.java:355) at org.apache.flink.client.program.Client.runBlocking(Client.java:315) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:60) at org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:652) at dima.tu.berlin.benchmark.flink.mlr.FlinkSLRTrainCommon$.main(FlinkSLRTrainCommon.scala:110) at dima.tu.berlin.benchmark.flink.mlr.FlinkSLRTrainCommon.main(FlinkSLRTrainCommon.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403) at org.apache.flink.client.program.Client.runBlocking(Client.java:248) at org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866) at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333) at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192) at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243) Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.RuntimeException: Emitting the record caused an I/O exception: Channel to path '/data/3/peel/flink/tmp/flink-io-3a97d62c-ada4-44f1-ab72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.channel' could not be opened. at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:69) at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:78) at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:97) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Channel to path '/data/3/peel/flink/tmp/flink-io-3a97d62c-ada4-44f1-ab72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.channel' could not be opened. at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.(AbstractFileIOChannel.java:61) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousFileIOChannel.(AsynchronousFileIOChannel.java:86) at org.apache.flink.runtime.io.disk.iomanager.AsynchronousBufferFileWriter.(AsynchronousBufferFileWriter.java:31) at org.apache.flink.runtime.io.disk.iomanager.IOManagerAsync.createBufferFileWriter(IOManagerAsync.java:257) at org.apache.flink.runtime.io.network.partition.SpillableSubpartition.releaseMemory(SpillableSubpartition.java:151) at org.apache.flink.runtime.io.network.partition.ResultPartition.releaseMemory(ResultPartition.java:366) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:159) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:133) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:92) at org.apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputCollector.java:65) ... 6 more Caused by: java.io.FileNotFoundException: /data/3/peel/flink/tmp/flink-io-3a97d62c-ada4-44f1-ab72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.channel (No such file or directory) at java.io.RandomAccessFile.open0(Native Method) at java.io.RandomAccessFile.open(RandomAccessFile.java:316) at java.io.RandomAccessFile.(RandomAccessFile.java:243) at java.io.RandomAccessFile.(RandomAccessFile.java:124) at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOChannel.(AbstractFileIOChannel.java:57) ... 15 more here the related jobmanager full log. I can't figure out a solution. Thank you and have a nice day. -- *Andrea Spina* Guest student at DIMA, TU Berlin N.Tessera: *74598* MAT: *89369* *Ingegneria Informatica* *[LM] *(D.M. 270) --94eb2c094e6e1dc490053653aa0c Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi everyone,

I am running some Flink ex= periments with Peel benchmark=C2=A0h= ttp://peel-framework.org/ and I am struggling with exceptions: the envi= ronment is a 25-nodes cluster, 16 cores per nodes. The dataset is ~80GiB an= d is located on Hdfs 2.7.1.

At the beginning I tried with 400 as deg= ree of parallelism and with the following configuration:

jobmanager.= rpc.address =3D ${runtime.hostname}
akka.log.lifecycle.events =3D ON
= akka.ask.timeout =3D 300s
jobmanager.rpc.port =3D 6002

jobmanager= .heap.mb =3D 1024
jobmanager.web.port =3D 6004

taskmanager.heap.m= b =3D 28672
taskmanager.memory.fraction =3D 0.7
taskmanager.network.n= umberOfBuffers =3D 32768
taskmanager.network.bufferSizeInBytes =3D 16384=
taskmanager.tmp.dirs =3D "/data/1/peel/flink/tmp:/data/2/peel/flin= k/tmp:/data/3/peel/flink/tmp:/data/4/peel/flink/tmp"
taskmanager.de= bug.memory.startLogThread =3D true

the following excepti= on will raise

Caused by: java.io.IOException: Insufficient nu= mber of network buffers: required 350, but only 317 available. The total nu= mber of network buffers is currently set to 32768. You can increase this nu= mber by setting the configuration key 'taskmanager.network.numberOfBuff= ers'.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime= .io.network.buffer.NetworkBufferPool.createBufferPool(NetworkBufferPool.jav= a:196)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.io= .network.NetworkEnvironment.registerTask(NetworkEnvironment.java:327)
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanager.Ta= sk.run(Task.java:469)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.lang.Th= read.run(Thread.java:745)

So I tried different sol= utions, both with increasing numberOfBuffers (Max value tried 98304) or dec= reasing the degreeOfParallelism (Min value tried 300) and testing those con= figs with a dummy dataset seems to solve the number of buffers issue.
Bu= t In each case with the 80GiB dataset now I struggle with a new exception; = the following with a degree of parallelism =3D 300 and numberOfBuffers =3D = 32768

org.apache.flink.client.program.Program= InvocationException: The program execution failed: Job execution failed.
at org.apache.fl= ink.client.program.Client.runBlocking(Client.java:381)
at org.apache.flink.client.program= .Client.runBlocking(Client.java:355)
at org.apache.flink.client.program.Client.runBlockin= g(Client.java:315)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextE= nvironment.java:60)
<= /span>at org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionE= nvironment.scala:652)
= at dima.tu.berlin.benchmark.flink.mlr.FlinkSLRTrainCommon$.main(Fli= nkSLRTrainCommon.scala:110)
at dima.tu.berlin.benchmark.flink.mlr.FlinkSLRTrainCommon.mai= n(FlinkSLRTrainCommon.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Metho= d)
at sun.refl= ect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
=
at sun.reflect.Dele= gatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
=
at java.lang.reflec= t.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callM= ainMethod(PackagedProgram.java:505)
at org.apache.flink.client.program.PackagedProgram.in= vokeInteractiveModeForExecution(PackagedProgram.java:403)
at org.apache.flink.client.prog= ram.Client.runBlocking(Client.java:248)
at org.apache.flink.client.CliFrontend.executePro= gramBlocking(CliFrontend.java:866)
at org.apache.flink.client.CliFrontend.run(CliFrontend= .java:333)
at = org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)<= /div>
at org.apache.= flink.client.CliFrontend.main(CliFrontend.java:1243)
Caused by: o= rg.apache.flink.runtime.client.JobExecutionException: Job execution failed.=
at org.apache= .flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$appl= yOrElse$7.apply$mcV$sp(JobManager.scala:717)
at org.apache.flink.runtime.jobmanager.JobMa= nager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scal= a:663)
at org.= apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfu= n$applyOrElse$7.apply(JobManager.scala:663)
at scala.concurrent.impl.Future$PromiseComple= tingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseComple= tingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.= scala:41)
at a= kka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDis= patcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260= )
at scala.con= current.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:12= 53)
at scala.c= oncurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concur= rent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin= .ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused b= y: java.lang.RuntimeException: Emitting the record caused an I/O exception:= Channel to path '/data/3/peel/flink/tmp/flink-io-3a97d62c-ada4-44f1-ab= 72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.channel' could not be = opened.
at org= .apache.flink.runtime.operators.shipping.OutputCollector.collect(OutputColl= ector.java:69)
at org.apache.flink.runtime.operators.chaining.ChainedMapDriver.collect(Ch= ainedMapDriver.java:78)
at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.jav= a:97)
at org.a= pache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
<= span class=3D"" style=3D"white-space:pre"> at org.apache.flink.runti= me.operators.BatchTask.invoke(BatchTask.java:345)
at org.apache.flink.runtime.taskmanager= .Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: ja= va.io.IOException: Channel to path '/data/3/peel/flink/tmp/flink-io-3a9= 7d62c-ada4-44f1-ab72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.channel&= #39; could not be opened.
at org.apache.flink.runtime.io.disk.iomanager.AbstractFileIOCha= nnel.<init>(AbstractFileIOChannel.java:61)
at org.apache.flink.runtime.io.disk.ioma= nager.AsynchronousFileIOChannel.<init>(AsynchronousFileIOChannel.java= :86)
at org.ap= ache.flink.runtime.io.disk.iomanager.AsynchronousBufferFileWriter.<init&= gt;(AsynchronousBufferFileWriter.java:31)
at org.apache.flink.runtime.io.disk.iomanager.= IOManagerAsync.createBufferFileWriter(IOManagerAsync.java:257)
at org.apache.flink.runtim= e.io.network.partition.SpillableSubpartition.releaseMemory(SpillableSubpart= ition.java:151)
at org.apache.flink.runtime.io.network.partition.ResultPartition.releaseM= emory(ResultPartition.java:366)
at org.apache.flink.runtime.io.network.buffer.LocalBuffer= Pool.requestBuffer(LocalBufferPool.java:159)
at org.apache.flink.runtime.io.network.buffe= r.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:133)
at org.apache.flink.run= time.io.network.api.writer.RecordWriter.emit(RecordWriter.java:92)
at org.apache.flink.ru= ntime.operators.shipping.OutputCollector.collect(OutputCollector.java:65)
... 6 more
Caused by: java.io.FileNotFoundException: /data/3/peel/flink/tmp/flin= k-io-3a97d62c-ada4-44f1-ab72-dd018386f9aa/79305169d69f7e8b361a175af87353fa.= channel (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomA= ccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(Rando= mAccessFile.java:243)
= at java.io.RandomAccessFile.<init>(RandomAccessFile.java:124)=
at org.apache= .flink.runtime.io.disk.iomanager.AbstractFileIOChannel.<init>(Abstrac= tFileIOChannel.java:57)
... 15 more

h= ere=C2=A0the related jobmanager full log. I can't figure out a solu= tion.

Thank you and have a nice day.
--
Andrea Spina
Guest student at DIMA, TU Berli= n
N.Tessera: 74598
MAT: 89369=
Ingegneria Informatica [LM] (D.M. 270)
=
--94eb2c094e6e1dc490053653aa0c--