flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Read once input data?
Date Tue, 16 Feb 2016 11:28:59 GMT
I also have a couple of use cases where the pin data sets in memory feature
would help a lot ;)

On Mon, Feb 15, 2016 at 10:18 PM, Saliya Ekanayake <esaliya@gmail.com>
wrote:

> Thanks, I'll check this.
>
> Saliya
>
> On Mon, Feb 15, 2016 at 4:08 PM, Fabian Hueske <fhueske@gmail.com> wrote:
>
>> I would have a look at the example programs in our code base:
>>
>>
>> https://github.com/apache/flink/tree/master/flink-examples/flink-examples-batch/src/main/java/org/apache/flink/examples/java
>>
>> Best, Fabian
>>
>> 2016-02-15 22:03 GMT+01:00 Saliya Ekanayake <esaliya@gmail.com>:
>>
>>> Thank you, Fabian.
>>>
>>> Any chance you might have an example on how to define a data flow with
>>> Flink?
>>>
>>>
>>>
>>> On Mon, Feb 15, 2016 at 3:58 PM, Fabian Hueske <fhueske@gmail.com>
>>> wrote:
>>>
>>>> It is not possible to "pin" data sets in memory, yet.
>>>> However, you can stream the same data set through two different mappers
>>>> at the same time.
>>>>
>>>> For instance you can have a job like:
>>>>
>>>>                  /---> Map 1 --> SInk1
>>>> Source --<
>>>>                  \---> Map 2 --> SInk2
>>>>
>>>> and execute it at once.
>>>> For that you define you data flow and call execute once after all sinks
>>>> have been created.
>>>>
>>>> Best, Fabian
>>>>
>>>> 2016-02-15 21:32 GMT+01:00 Saliya Ekanayake <esaliya@gmail.com>:
>>>>
>>>>> Fabian,
>>>>>
>>>>> count() was just an example. What I would like to do is say run two
>>>>> map operations on the dataset (ds). Each map will have it's own reduction,
>>>>> so is there a way to avoid creating two jobs for such scenario?
>>>>>
>>>>> The reason is, reading these binary matrices are expensive. In our
>>>>> current MPI implementation, I am using memory maps for faster loading
and
>>>>> reuse.
>>>>>
>>>>> Thank you,
>>>>> Saliya
>>>>>
>>>>> On Mon, Feb 15, 2016 at 3:15 PM, Fabian Hueske <fhueske@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> it looks like you are executing two distinct Flink jobs.
>>>>>> DataSet.count() triggers the execution of a new job. If you have
an
>>>>>> execute() call in your program, this will lead to two Flink jobs
being
>>>>>> executed.
>>>>>> It is not possible to share state among these jobs.
>>>>>>
>>>>>> Maybe you should add a custom count implementation (using a
>>>>>> ReduceFunction) which is executed in the same program as the other
>>>>>> ReduceFunction.
>>>>>>
>>>>>> Best, Fabian
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2016-02-15 21:05 GMT+01:00 Saliya Ekanayake <esaliya@gmail.com>:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I see that an InputFormat's open() and nextRecord() methods get
>>>>>>> called for each terminal operation on a given dataset using that
particular
>>>>>>> InputFormat. Is it possible to avoid this - possibly using some
caching
>>>>>>> technique in Flink?
>>>>>>>
>>>>>>> For example, I've some code like below and I see for both the
last
>>>>>>> two statements (reduce() and count()) the above methods in the
input format
>>>>>>> get called. Btw. this is a custom input format I wrote to represent
a
>>>>>>> binary matrix stored as Short values.
>>>>>>>
>>>>>>> ShortMatrixInputFormat smif = new ShortMatrixInputFormat();
>>>>>>>
>>>>>>> DataSet<Short[]> ds = env.createInput(smif, BasicArrayTypeInfo.SHORT_ARRAY_TYPE_INFO);
>>>>>>>
>>>>>>> MapOperator<Short[], DoubleStatistics> op = ds.map(...)
>>>>>>>
>>>>>>> *op.reduce(...)*
>>>>>>>
>>>>>>> *op.count(...)*
>>>>>>>
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Saliya
>>>>>>> --
>>>>>>> Saliya Ekanayake
>>>>>>> Ph.D. Candidate | Research Assistant
>>>>>>> School of Informatics and Computing | Digital Science Center
>>>>>>> Indiana University, Bloomington
>>>>>>> Cell 812-391-4914
>>>>>>> http://saliya.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>> Cell 812-391-4914
>>>>> http://saliya.org
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>

Mime
View raw message