flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Kloudas <k.klou...@data-artisans.com>
Subject Re: stream keyBy without repartition
Date Tue, 24 May 2016 15:45:00 GMT
Hi Bart,

From what I understand, you want to do a partial (per node) aggregation before shipping the
for the final one at the end. In addition, the keys do not seem to change between aggregations,

If this is the case, this is the functionality of the Combiner in batch. 
In Batch (DataSet API) this is supported, but in Streaming it is not.

If your main concern is optimizing your already up-and-running job, it would be worth sharing
your code
(or an example with the same characteristics / communication patterns if the real code is
not possible)
so that we can have a look and potentially find other parts of the pipeline that can be optimized.

For example, given that you are concerned with the serialization overhead, it may be worth

seeing if there are better alternatives to use.


> On May 24, 2016, at 4:22 PM, Bart Wyatt <bart.wyatt@dsvolition.com> wrote:
> (migrated from IRC)
> Hello All,
> My situation is this: 
> I have a large amount of data partitioned in kafka by "session" (natural partitioning).
 After I read the data, I would like to do as much as possible before incurring re-serialization
or network traffic due to the size of the data.  I am on 1.0.3 in the java API.
> What I'd like to do is:
> while maintaining the natural partitioning (so that a single thread can perform this)
read data from kafka, perform a window'd fold over the incoming data keyed by a _different_
field("key") then take the product of that window'd fold and allow re-partitioning to colocate
data with equivalent keys in a new partitioning scheme where they can be reduced into a final
product.  The hope is that the products of such a windowed fold are orders of magnitude smaller
than the data that would be serialized/sent if we re-partitioned before the window'd fold.
> Is there a way to .keyBy(...) such that it will act within the physical partitioning
of the data and not force a  re-partitioning of the data by that key?  
> thanks
> -Bart
CONFIDENTIAL COMMUNICATION intended solely for the recipient and, therefore, may not be retransmitted
to any party outside of the recipient's organization without the prior written consent of
the sender. If you have received this e-mail in error please notify the sender immediately
by telephone or reply e-mail and destroy the original message without making a copy. Deep
Silver, Inc. accepts no liability for any losses or damages resulting from infected e-mail
transmissions and viruses in e-mail attachments.

View raw message