flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kostas Kloudas <k.klou...@data-artisans.com>
Subject Re: Dynamic partitioning for stream output
Date Tue, 24 May 2016 12:50:32 GMT
Hi Juho,

If I understand correctly, you want a custom RollingSink that caches some 
buckets, one for each topic/date key, and whenever the volume of data buffered
exceeds a limit, then it flushes to disk, right? 

If this is the case, then you are right that this is not currently supported
out-of-the-box, but it would be interesting to update the RollingSink 
to support such scenarios.

One clarification: when you say that you want partition by date,
you mean the date of the event, right? Not the processing time.

Kostas

> On May 24, 2016, at 1:22 PM, Juho Autio <juho.autio@rovio.com> wrote:
> 
> Could you suggest how to dynamically partition data with Flink streaming?
> 
> We've looked at RollingSink, that takes care of writing batches to S3, but
> it doesn't allow defining the partition dynamically based on the tuple
> fields.
> 
> Our data is coming from Kafka and essentially has the kafka topic and a
> date, among other fields.
> 
> We'd like to consume all topics (also automatically subscribe to new ones)
> and write to S3 partitioned by topic and date, for example:
> 
> s3://bucket/path/topic=topic2/date=20160522/
> s3://bucket/path/topic=topic2/date=20160523/
> s3://bucket/path/topic=topic1/date=20160522/
> s3://bucket/path/topic=topic1/date=20160523/
> 
> There are two problems with RollingSink as it is now:
> - Only allows partitioning by date
> - Flushes the batch every time the path changes. In our case the stream can
> for example have a random mix of different topics and that would mean that
> RollingSink isn't able to respect the max flush size but keeps flushing the
> files pretty much on every tuple.
> 
> We've thought that we could implement a sink that internally creates and
> handles multiple RollingSink instances as needed for partitions. But it
> would be great to first hear any suggestions that you might have.
> 
> If we have to extend RollingSink, it would be nice to make it take a
> partitioning function as a parameter. The function would be called for each
> tuple to create the output path.
> 
> 
> 
> --
> View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-partitioning-for-stream-output-tp7122.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.


Mime
View raw message