flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Theodore Vasiloudis <theodoros.vasilou...@gmail.com>
Subject Re: Having a single copy of an object read in a RichMapFunction
Date Fri, 05 Aug 2016 16:36:29 GMT
Yes this is a streaming use case, so broadcast is not an option.

If I get it correctly with connected streams I would emulate side input by
"streaming" the matrix with a key that all incoming vector records match on?

Wouldn't that create multiple copies of the matrix in memory?

On Thu, Aug 4, 2016 at 6:56 PM, Sameer W <sameer@axiomine.com> wrote:

> Theodore,
> Broadcast variables do that when using the DataSet API -
> http://data-artisans.com/how-to-factorize-a-700-gb-
> matrix-with-apache-flink/
> See the following lines in the article-
> To support the above presented algorithm efficiently we had to improve
> Flinkā€™s broadcasting mechanism since it easily becomes the bottleneck of
> the implementation. The enhanced Flink version can share broadcast
> variables among multiple tasks running on the same machine. *Sharing
> avoids having to keep for each task an individual copy of the broadcasted
> variable on the heap. This increases the memory efficiency significantly,
> especially if the broadcasted variables can grow up to several GBs of size.*
> If you are using in the DataStream API then side-inputs (not yet
> implemented) would achieve the same as broadcast variables.  (
> https://docs.google.com/document/d/1hIgxi2Zchww_5fWUHLoYiXwSBXjv-M5eOv-
> MKQYN3m4/edit#) . I use keyed Connected Streams in situation where I need
> them for one of my use-cases (propagating rule changes to the data) where I
> could have used side-inputs.
> Sameer
> On Thu, Aug 4, 2016 at 8:56 PM, Theodore Vasiloudis <
> theodoros.vasiloudis@gmail.com> wrote:
>> Hello all,
>> for a prototype we are looking into we would like to read a big matrix
>> from HDFS, and for every element that comes in a stream of vectors do on
>> multiplication with the matrix. The matrix should fit in the memory of one
>> machine.
>> We can read in the matrix using a RichMapFunction, but that would mean
>> that a copy of the matrix is made for each Task Slot AFAIK, if the
>> RichMapFunction is instantiated once per Task Slot.
>> So I'm wondering how should we try address this problem, is it possible
>> to have just one copy of the object in memory per TM?
>> As a follow-up if we have more than one TM per node, is it possible to
>> share memory between them? My guess is that we have to look at some
>> external store for that.
>> Cheers,
>> Theo

View raw message