flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Newport, Billy" <Billy.Newp...@gs.com>
Subject RE: Collector.collect
Date Mon, 01 May 2017 12:52:21 GMT
We’ve done that but it’s very expensive from a serialization point of view when writing
the same record multiple times, each in a different tuple.

For example, we started with this:

.collect(new Tuple<Short, GenericRecord)).

The record would be written with short = 0 and again with short = 1. This results in the GenericRecord
being serialized twice. You also prolly need filters on the output dataset which is expensive

We switched instead to a bitmask. Now, we write the record once and set bits in the short
for each file the record needs to be written to. Our next step is to write records to a file
based on the short. We wrote a new outputrecordformat which checks the bits in the short and
writes the GenericRecord to each file for the corresponding bit. This means no filter to split
the records for each file and this is much faster.

We’re finding a need to do this kind of optimization pretty frequently with flink.

From: Gaurav Khandelwal [mailto:gaurav671989@gmail.com]
Sent: Saturday, April 29, 2017 4:32 AM
To: user@flink.apache.org
Subject: Collector.collect


I am working on RichProcessFunction and I want to emit multiple records at a time. To achieve
this, I am currently doing :

   Collector.collect(new Tuple<>...);

I was wondering, is this the correct way or there is any other alternative.

View raw message