flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Runtime generated (source) datasets
Date Wed, 21 Jan 2015 13:41:52 GMT
The program is compiled when the ExecutionEnvironment.execute() method is
called. At that moment, theEexecutionEnvironment collects all data sources
that were previously created and traverses them towards connected data
sinks. All sinks that are found this way are remembered and treated as
execution targets. The sinks and all connected operators and data sources
are given to the optimizer which analyses the plan, compiles an execution
plan, and submits the plan to the execution system which the
ExecutionEnvironment refers to (local, remote, ...).

Therefore, your code can build arbitrary data flows with as many source as
you like. Once you call ExecutionEnvironment.execute() all data sources and
operators which are required to compute the result of all data sinks are

2015-01-21 14:26 GMT+01:00 Flavio Pompermaier <pompermaier@okkam.it>:

> Great! Could you explain me a little bit the internals of how and when
> Flink will generate the plan and how the execution environment is involved
> in this phase?
> Just to better understand this step!
> Thanks again,
> Flavio
> On Wed, Jan 21, 2015 at 2:14 PM, Till Rohrmann <trohrmann@apache.org>
> wrote:
>> Yes this will also work. You only have to make sure that the list of data
>> sets is processed properly later on in your code.
>> On Wed, Jan 21, 2015 at 2:09 PM, Flavio Pompermaier <pompermaier@okkam.it
>> > wrote:
>>> Hi Till,
>>> thanks for the reply. However my problem is that I'll have something
>>> like:
>>> List<Dataset<<ElementType>>  getInput(String[] args,
>>> ExecutionEnvironment env) {....}
>>> So I don't know in advance how many of them I'll have at runtime. Does
>>> it still work?
>>> On Wed, Jan 21, 2015 at 1:55 PM, Till Rohrmann <trohrmann@apache.org>
>>> wrote:
>>>> Hi Flavio,
>>>> if your question was whether you can write a Flink job which can read
>>>> input from different sources, depending on the user input, then the answer
>>>> is yes. The Flink job plans are actually generated at runtime so that you
>>>> can easily write a method which generates a user dependent input/data set.
>>>> You could do something like this:
>>>> DataSet<ElementType> getInput(String[] args, ExecutionEnvironment env)
>>>>   if(args[0] == csv) {
>>>>     return env.readCsvFile(...);
>>>>   } else {
>>>>     return env.createInput(new AvroInputFormat<ElementType>(...));
>>>>   }
>>>> }
>>>> as long as the element type of the data set are all equal for all
>>>> possible data sources. I hope that I understood your problem correctly.
>>>> Greets,
>>>> Till
>>>> On Wed, Jan 21, 2015 at 11:45 AM, Flavio Pompermaier <
>>>> pompermaier@okkam.it> wrote:
>>>>> Hi guys,
>>>>> I have a big question for you about how Fling handles job's plan
>>>>> generation:
>>>>> let's suppose that I want to write a job that takes as input a
>>>>> description of a set of datasets that I want to work on (for example
a csv
>>>>> file and its path, 2 hbase tables, 1 parquet directory and its path,
>>>>> From what I know Flink generates the job's plan at compile time, so I
>>>>> was wondering whether this is possible right now or not..
>>>>> Thanks in advance,
>>>>> Flavio

View raw message