flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias J. Sax" <mj...@apache.org>
Subject Re: How to create a stream of data batches
Date Fri, 04 Sep 2015 11:12:33 GMT
Hi Andres,

you could do this by using your own data type, for example

> public class MyBatch {
>   private ArrayList<MyTupleType> data = new ArrayList<MyTupleType>
> }

In the DataSource, you need to specify your own InputFormat that reads
multiple tuples into a batch and emits the whole batch at once.

However, be aware, that this POJO type hides the batch nature from
Flink, ie, Flink does not know anything about the tuples in the batch.
To Flink a batch is a single tuple. If you want to perform key-based
operations, this might become a problem.


On 09/04/2015 01:00 PM, Andres R. Masegosa  wrote:
> Hi,
> I'm trying to code some machine learning algorithms on top of flink such
> as a variational Bayes learning algorithms. Instead of working at a data
> element level (i.e. using map transformations), it would be far more
> efficient to work at a "batch of elements" levels (i.e. I get a batch of
> elements and I produce some output).
> I could code that using "mapPartition" function. But I can not control
> the size of the partition, isn't?
> Is there any way to transform a stream (or DataSet) of elements in a
> stream (or DataSet) of data batches with the same size?
> Thanks for your support,
> Andres

View raw message