flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Ingestion of data into HDFS
Date Fri, 22 May 2015 16:44:00 GMT
Hi Stephan and Fabian,
I'll try to make it more clear..:)
in my use case I have a webapp that receives a series of row sets one after
the other (I have also a start and an end event that determines when such
process starts and ends).
At every request, the batch of rows is translated in another batch of rows
(i.e. what in Flink it's a flatMap...) that I want to store in HDFS or a
local fs (this is a non Flink part but I need to decide which format to
use..I Imagine Tuples could be ok and thus I could store such lines
as TypeSerializerInputFormat/TypeSerializerOutputFormat in a file, for each
batch).
Those files will contains duplicated tuples so, once this first process
finish, I need to read all those files and save them in a Parquet directory.
What I'd like to know how can for example generate a new file for each
batch (I was thinking to use UUIDs) or if there's something in flink to
manage such append-like mechanism and its following compaction..
Also the possibility to run distinct on tuples with null values could be a
very nice improvement of Flink...

Best,
Flavio

On Fri, May 22, 2015 at 6:29 PM, Stephan Ewen <sewen@apache.org> wrote:

> If you simply want to trigger two Flink jobs one after the other, you can
> simply do this in one program.
>
> Since the "env.execute()" call blocks, the second program starts after the
> first.
>
> -----
>
> ExecutionEnvironment env1 = ExecutionEnvironment.getExecutionEnvironment();
>
> // program 1
>
> env1.execute();
>
>
> ExecutionEnvironment env2 = ExecutionEnvironment.getExecutionEnvironment();
>
> // program 2
>
> env2.execute();
>
>
>
>
> On Fri, May 22, 2015 at 6:21 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> I'm not sure if I got your question right.
>>
>> Do you want to know if it is possible to implement a Flink program that
>> reads several files and writes their data into a Parquet format?
>> Or are you asking how such a job could be scheduled for execution based
>> on some external event (such as a file appearing)?
>>
>> Both should be possible.
>>
>> The job would be a simple pipeline with or without some transformations
>> depending on the required logic and a Parquet data sink.
>> The job execution can be triggered from outside of Flink for example
>> using a monitoring process or a cron job that calls the CLI client with the
>> right parameters.
>>
>> Best, Fabian
>>
>>
>>
>> 2015-05-22 14:55 GMT+02:00 Flavio Pompermaier <pompermaier@okkam.it>:
>>
>>> Hi to all,
>>>
>>> in my use case I have bursts of data to store into hdfs and once
>>> finished, compact them into a single directory (as Parquet). From what I
>>> know, the current approach is to use Flume that automatically ingest data
>>> and compact them based on some configurable policy.
>>> However I'd like to avoid to add Flume to my architecture because these
>>> bursts are not long lived processed so I just want to write a batch of rows
>>> as a single file in some directory, and once the process finish, i want to
>>> read all of them and compact into a single output directory as Parquet.
>>> It's something similar to a streaming process but (for the moment) I'd
>>> like to avoid to have a long lived Flink process listening for incoming
>>> data.
>>>
>>> Do you have any suggestion for such a process or is there any example in
>>> Flink code?
>>>
>>>
>>> Best,
>>> Flavio
>>>
>>
>>
>

Mime
View raw message