Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAAdrtT1Z-wCQop+vvFWwPqJa8nd6+pCE4WCJtdLS4=5yYnTiXA@mail.gmail.com>
References: <4A.6E.19519.7AFBD275@fep46>
	<CAELUF_Aq-1Q+sr1SbiaTfjDdz0vP6hiHamx8d7iN6e+m1hjJ+g@mail.gmail.com>
	<CAELUF_A0i5B8drt_6fgPi_NUavp8aOjRMGH6YK0CBz6t-gxMqw@mail.gmail.com>
	<CAAdrtT1Z-wCQop+vvFWwPqJa8nd6+pCE4WCJtdLS4=5yYnTiXA@mail.gmail.com>
Date: Sat, 7 May 2016 15:58:28 +0200
Message-ID: <CAELUF_CVhtVdjKq8JMLXo8YkY0TrdrVD1+eh-UKwxiymPacbyQ@mail.gmail.com>
Subject: Re: How to make Flink read all files in HDFS folder and do
 transformations on th e data
From: Flavio Pompermaier <pompermaier@okkam.it>
To: user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary=089e01177b294fa650053240f830
archived-at: Sat, 07 May 2016 13:58:36 -0000

--089e01177b294fa650053240f830
Content-Type: text/plain; charset=UTF-8

Sorry Palle,
I wrongly understood that you were trying to read a single json object per
file...the solution suggested by Fabian is definitely the right solution
for your specific use case!

Best,
Flavio
On 7 May 2016 12:52, "Fabian Hueske" <fhueske@gmail.com> wrote:

> Hi Palle,
>
> you can recursively read all files in a folder as explained in the
> "Recursive Traversal of the Input Path Directory" section of the Data
> Source documentation [1].
>
> The easiest way to read line-wise JSON objects is to use
> ExecutionEnvironment.readTextFile() which reads text files linewise as
> strings and a subsequent mapper that uses a JSON parser (e.g., Jackson) to
> parse the JSON strings. You should use a RichMapFunction and create the
> parser in the open() method to avoid instantiating a new parser for each
> incoming line. After parsing, the RichMapFunction can emit POJOs.
>
> Cheers, Fabian
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#data-sources
>
> 2016-05-07 12:25 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>
>> I had the same issue :)
>> I resolved it reading all file paths in a collection, then using this
>> code:
>>
>> env.fromCollection(filePaths).rebalance().map(file2pojo)
>>
>> You can have your dataset of Pojos!
>>
>> The rebalance() is necessary to exploit parallelism,otherwise the
>> pipeline will be executed with parallelism 1.
>>
>> Best,
>> Flavio
>> On 7 May 2016 12:13, "Palle" <palle@sport.dk> wrote:
>>
>> Hi there.
>>
>> I've got a HDFS folder containing a lot of files. All files contains a
>> lot of JSON objects, one for each line. I will have several TB in the HDFS
>> folder.
>>
>> My plan is to make Flink read all files and all JSON objects and then do
>> some analysis on the data, actually very similar to the
>> flatMap/groupBy/reduceGroup transformations that is done in the WordCount
>> example.
>>
>> But I am a bit stuck, because I cannot seem to find out how to make Flink
>> read all files in a HDFS dir and then perform the transformations on the
>> data. I have googled quite a bit and also looked in the Flink API and mail
>> history.
>>
>> Can anyone point me to an example where Flink is used to read all files
>> in a HDFS folder and then do transformations on the data)?
>>
>> - and a second question: Is there an elegant way to make Flink handle the
>> JSON objects? - can they be converted to POJOs by something similar to the
>> pojoType() method?
>>
>> /Palle
>>
>>
>

--089e01177b294fa650053240f830
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Sorry Palle,<br>
I wrongly understood that you were trying to read a single json object per =
file...the solution suggested by Fabian is definitely the right solution fo=
r your specific use case!</p>
<p dir=3D"ltr">Best,<br>
Flavio</p>
<div class=3D"gmail_quote">On 7 May 2016 12:52, &quot;Fabian Hueske&quot; &=
lt;<a href=3D"mailto:fhueske@gmail.com">fhueske@gmail.com</a>&gt; wrote:<br=
 type=3D"attribution"><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>=
Hi Palle,</div><div><br></div><div>you can recursively read all files in a =
folder as explained in the &quot;Recursive Traversal of the Input Path Dire=
ctory&quot; section=C2=A0of the Data Source documentation [1].</div><div><b=
r></div><div>The easiest way to read line-wise JSON objects is to use Execu=
tionEnvironment.readTextFile() which reads text files linewise as strings a=
nd a subsequent mapper that uses a JSON parser (e.g., Jackson) to parse the=
 JSON strings. You should use a RichMapFunction and create the parser in th=
e open() method to avoid=C2=A0instantiating a=C2=A0new parser for each inco=
ming line. After parsing, the RichMapFunction can emit POJOs.</div><div><br=
></div><div>Cheers, Fabian</div><div><br></div><div>[1] <a href=3D"https://=
ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#d=
ata-sources" target=3D"_blank">https://ci.apache.org/projects/flink/flink-d=
ocs-release-1.0/apis/batch/index.html#data-sources</a></div></div><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">2016-05-07 12:25 GMT+02:00=
 Flavio Pompermaier <span dir=3D"ltr">&lt;<a href=3D"mailto:pompermaier@okk=
am.it" target=3D"_blank">pompermaier@okkam.it</a>&gt;</span>:<br><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><p dir=3D"ltr">I had the same issue :)<br>
I resolved it reading all file paths in a collection, then using this code:=
<br>
 <br>
env.fromCollection(filePaths).rebalance().map(file2pojo)</p>
<p dir=3D"ltr">You can have your dataset of Pojos!</p>
<p dir=3D"ltr">The rebalance() is necessary to exploit parallelism,otherwis=
e the pipeline will be executed with parallelism 1.</p>
<p dir=3D"ltr">Best,<br>
Flavio</p><div><div>
<div class=3D"gmail_quote">On 7 May 2016 12:13, &quot;Palle&quot; &lt;<a hr=
ef=3D"mailto:palle@sport.dk" target=3D"_blank">palle@sport.dk</a>&gt; wrote=
:<br type=3D"attribution"><blockquote style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex">Hi there.<br>
<br>
I&#39;ve got a HDFS folder containing a lot of files. All files contains a =
lot of JSON objects, one for each line. I will have several TB in the HDFS =
folder.<br>
<br>
My plan is to make Flink read all files and all JSON objects and then do so=
me analysis on the data, actually very similar to the flatMap/groupBy/reduc=
eGroup transformations that is done in the WordCount example.<br>
<br>
But I am a bit stuck, because I cannot seem to find out how to make Flink r=
ead all files in a HDFS dir and then perform the transformations on the dat=
a. I have googled quite a bit and also looked in the Flink API and mail his=
tory.<br>
<br>
Can anyone point me to an example where Flink is used to read all files in =
a HDFS folder and then do transformations on the data)?<br>
<br>
- and a second question: Is there an elegant way to make Flink handle the J=
SON objects? - can they be converted to POJOs by something similar to the p=
ojoType() method?<br>
<font color=3D"#888888"><br>
/Palle<br>
</font></blockquote></div>
</div></div></blockquote></div><br></div>
</blockquote></div>

--089e01177b294fa650053240f830--