flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aggarwal, Ajay" <Ajay.Aggar...@netapp.com>
Subject Re: stream of large objects
Date Mon, 11 Feb 2019 17:39:42 GMT
I looked a little into broadcast state and while its interesting I don’t think it will help
me. Since broadcast state is kept all in-memory, I am worried about memory requirement if
I make all these LargeMessages part of broadcast state. Furthermore these LargeMessages need
to be processed in a Keyed context, so sharing all of these across all downstream tasks does
not seem efficient.

From: Chesnay Schepler <chesnay@apache.org>
Date: Sunday, February 10, 2019 at 4:57 AM
To: "Aggarwal, Ajay" <Ajay.Aggarwal@netapp.com>, "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: stream of large objects

NetApp Security WARNING: This is an external email. Do not click links or open attachments
unless you recognize the sender and know the content is safe.

The Broadcast State<https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/state/broadcast_state.html#the-broadcast-state-pattern>
may be interesting to you.

On 08.02.2019 15:57, Aggarwal, Ajay wrote:
Yes, another KeyBy will be used. The “small size” messages will be strings of length 500
to 1000.

Is there a concept of “global” state in flink? Is it possible to keep these lists in global
state and only pass the list reference (by name?) in the LargeMessage?

From: Chesnay Schepler <chesnay@apache.org><mailto:chesnay@apache.org>
Date: Friday, February 8, 2019 at 8:45 AM
To: "Aggarwal, Ajay" <Ajay.Aggarwal@netapp.com><mailto:Ajay.Aggarwal@netapp.com>,
"user@flink.apache.org"<mailto:user@flink.apache.org> <user@flink.apache.org><mailto:user@flink.apache.org>
Subject: Re: stream of large objects

Whether a LargeMessage is serialized depends on how the job is structured.
For example, if you were to only apply map/filter functions after the aggregation it is likely
they wouldn't be serialized.
If you were to apply another keyBy they will be serialized again.

When you say "small size" messages, what are we talking about here?

On 07.02.2019 20:37, Aggarwal, Ajay wrote:
In my use case my source stream contain small size messages, but as part of flink processing
I will be aggregating them into large messages and further processing will happen on these
large messages. The structure of this large message will be something like this:

   Class LargeMessage {
        String key
       List <String> messages; // this is where the aggregation of smaller messages

In some cases this list field of LargeMessage can get very large (1000’s of messages). Is
it ok to create an intermediate stream of these LargeMessages? What should I be concerned
about while designing the flink job? Specifically with parallelism in mind. As these LargeMessages
flow from one flink subtask to another, do they get serialized/deserialized ?


View raw message