Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C170184A5 for ; Tue, 26 May 2015 17:30:54 +0000 (UTC) Received: (qmail 21398 invoked by uid 500); 26 May 2015 17:30:54 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 21331 invoked by uid 500); 26 May 2015 17:30:54 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 21322 invoked by uid 99); 26 May 2015 17:30:53 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 May 2015 17:30:53 +0000 Received: from mail-la0-f47.google.com (mail-la0-f47.google.com [209.85.215.47]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 5B8F61A046D for ; Tue, 26 May 2015 17:30:53 +0000 (UTC) Received: by laat2 with SMTP id t2so72792183laa.1 for ; Tue, 26 May 2015 10:30:51 -0700 (PDT) X-Gm-Message-State: ALoCoQnJ64lRCG097JCL6c/UyC2ygSuHSEK0kltFb8o3awqX398555kU24hile5neD8eQ8aGz3mQ X-Received: by 10.112.182.4 with SMTP id ea4mr22261323lbc.35.1432661451520; Tue, 26 May 2015 10:30:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.31.15 with HTTP; Tue, 26 May 2015 10:30:30 -0700 (PDT) In-Reply-To: References: From: Maximilian Michels Date: Tue, 26 May 2015 19:30:30 +0200 Message-ID: Subject: Re: Recursive directory reading error To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=001a11c3728ee6d7750516ff7c77 --001a11c3728ee6d7750516ff7c77 Content-Type: text/plain; charset=UTF-8 Pushed a fix to the master and will open a PR to programmatically fix this. On Tue, May 26, 2015 at 4:22 PM, Flavio Pompermaier wrote: > Yeap, that definitively solves the problem! Could you make a PR to fix > that..? > > Thank you in advance, > Flavio > > On Tue, May 26, 2015 at 3:20 PM, Maximilian Michels > wrote: > >> Yes, there is a loop to recursively search for files in directory but >> that should be ok. The code fails when adding a new InputSplit to an >> ArrayList. This is a standard operation. >> >> Oh, I think I found a bug in `addNestedFiles`. It does not pick up the >> length of the recursively found files in line 546. That can result in a >> returned size of 0 which causes infinite InputSplits to be created and >> added to the aforementioned ArrayList. Can you change >> >> addNestedFiles(dir.getPath(), files, length, logExcludedFiles); >> >> to >> >> length += addNestedFiles(dir.getPath(), files, length, logExcludedFiles); >> >> ? >> >> >> >> On Tue, May 26, 2015 at 2:21 PM, Flavio Pompermaier > > wrote: >> >>> I have 10 files..I debugged the code and it seems that there's a loop in >>> the FileInputFormat when files are nested far away from the root directory >>> of the scan >>> >>> On Tue, May 26, 2015 at 2:14 PM, Robert Metzger >>> wrote: >>> >>>> Hi Flavio, >>>> >>>> how many files are in the directory? >>>> You can count with "find /tmp/myDir | wc -l" >>>> >>>> Flink running out of memory while creating input splits indicates to me >>>> that there are a lot of files in there. >>>> >>>> On Tue, May 26, 2015 at 2:10 PM, Flavio Pompermaier < >>>> pompermaier@okkam.it> wrote: >>>> >>>>> Hi to all, >>>>> >>>>> I'm trying to recursively read a directory but it seems that the >>>>> totalLength value in the FileInputformat.createInputSplits() is not >>>>> computed correctly.. >>>>> >>>>> I have a files organized as: >>>>> >>>>> /tmp/myDir/A/B/cunk-1.txt >>>>> /tmp/myDir/A/B/cunk-2.txt >>>>> .. >>>>> >>>>> If I try to do the following: >>>>> >>>>> Configuration parameters = new Configuration(); >>>>> parameters.setBoolean("recursive.file.enumeration", true); >>>>> >>>>> env.readTextFile("file:////tmp/myDir)).withParameters(parameters).print(); >>>>> >>>>> I get: >>>>> >>>>> Caused by: org.apache.flink.runtime.JobException: Creating the input >>>>> splits caused an error: Java heap space >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:162) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:471) >>>>> at org.apache.flink.runtime.jobmanager.JobManager.org >>>>> $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:515) >>>>> ... 19 more >>>>> Caused by: java.lang.OutOfMemoryError: Java heap space >>>>> at java.util.Arrays.copyOf(Arrays.java:2219) >>>>> at java.util.ArrayList.grow(ArrayList.java:242) >>>>> at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216) >>>>> at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208) >>>>> at java.util.ArrayList.add(ArrayList.java:440) >>>>> at >>>>> org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:503) >>>>> at >>>>> org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51) >>>>> at >>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:146) >>>>> >>>>> Am I doing something wrong or is it a bug? >>>>> >>>>> Best, >>>>> Flavio >>>>> >>>> >>>> >>> >>> >> > --001a11c3728ee6d7750516ff7c77 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Pushed a fix to the master and will open a PR to programma= tically fix this.

On Tue, May 26, 2015 at 4:22 PM, Flavio Pompermaier &= lt;pompermaier@ok= kam.it> wrote:
Yeap, that definitively solves the problem! Could you make a PR to fix= that..?

Thank you in advance,
Flavio

On Tue, Ma= y 26, 2015 at 3:20 PM, Maximilian Michels <mxm@apache.org> wrot= e:
Yes, there = is a loop to recursively search for files in directory but that should be o= k. The code fails when adding a new InputSplit to an ArrayList. This is a = standard operation.

Oh, I think I found a bug in `addNest= edFiles`. It does not pick up the length of the recursively found files in = line 546. That can result in a returned size of 0 which causes infinite Inp= utSplits to be created and added to the aforementioned ArrayList. Can you c= hange

addNestedFiles(dir.getPat=
h(), files, length, logExcludedFiles);
to

length +=3D addNestedFiles(dir.getPath(), fil= es, length, logExcludedFiles);

?

On Tue, May 26, 2015 at 2:21 PM, Flavio Pompermaier <pomp= ermaier@okkam.it> wrote:
I have 10 files..I debugged the code and it seems that there= 's a loop in the FileInputFormat when files are nested far away from th= e root directory of the scan

On Tue, May 26, 2015 at 2:14 PM, Robert Metzger <rm= etzger@apache.org> wrote:
<= div dir=3D"ltr">Hi Flavio,

how many files are in the dir= ectory?
You can count with "find=C2=A0/tmp/myDir | wc -l"

Flink running out of memory while creating input splits indicates to me= that there are a lot of files in there.

On Tue, May 26, 2015 at= 2:10 PM, Flavio Pompermaier <pompermaier@okkam.it> wrote= :
Hi to all,

I'm trying to recursively read a directory but it seems that the = =C2=A0totalLength value in the FileInputformat.createInputSplits() is not c= omputed correctly..

I have a files organized as:

/tmp/myDir/A/B/cunk-1.txt
/tmp/myDir/A/B/= cunk-2.txt
=C2=A0..

If I try to do t= he following:

Configuration parameters =3D ne= w Configuration();
parameters.setBoolean("recursive.file.enu= meration", true);
env.readTextFile("file:////tmp/myDir)= ).withParameters(parameters).print();

I = get:

Caused by: org.apache.flink.runtime.JobE= xception: Creating the input splits caused an error: Java heap space
<= div> at org.apache.flink.runtim= e.executiongraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:16= 2)
at org.apache.fli= nk.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java= :471)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runti= me$jobmanager$JobManager$$submitJob(JobManager.scala:515)
... 19 more
Caused by: java= .lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2219)
= at java.util.ArrayList.grow(Ar= rayList.java:242)
at= java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
<= span style=3D"white-space:pre-wrap"> at java.util.ArrayList.ensureCa= pacityInternal(ArrayList.java:208)
at java.util.ArrayList.add(ArrayList.java:440)
at org.apache.flink.api.common.io= .FileInputFormat.createInputSplits(FileInputFormat.java:503)
at org.apache.flink.api.common.io.= FileInputFormat.createInputSplits(FileInputFormat.java:51)
at org.apache.flink.runtime.executio= ngraph.ExecutionJobVertex.<init>(ExecutionJobVertex.java:146)

Am I doing something wrong or is it a bug?

Best,
Flavio



=



<= /p>


--001a11c3728ee6d7750516ff7c77--