Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <CAMJEyBYB7bC9tscUMn6LUXfvky8WOH4abzZUupSaqL4bTJGJpg@mail.gmail.com>
References: <1464088960939-7122.post@n4.nabble.com>
	<99D798EB-0E63-4D45-B31E-E238DE02877D@data-artisans.com>
	<CAMJEyBYB7bC9tscUMn6LUXfvky8WOH4abzZUupSaqL4bTJGJpg@mail.gmail.com>
Date: Wed, 25 May 2016 09:40:18 +0300
Message-ID: <CAMJEyBaLtWXMw0LPbVAmx+huN3smTmJi++F+tFbyVwjTaczYbA@mail.gmail.com>
Subject: Re: Dynamic partitioning for stream output
From: Juho Autio <juho.autio@rovio.com>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114c8c946b3a360533a4f291
archived-at: Wed, 25 May 2016 06:40:30 -0000

--001a114c8c946b3a360533a4f291
Content-Type: text/plain; charset=UTF-8

Related issue: https://issues.apache.org/jira/browse/FLINK-2672

On Wed, May 25, 2016 at 9:21 AM, Juho Autio <juho.autio@rovio.com> wrote:

> Thanks, indeed the desired behavior is to flush if bucket size exceeds a
> limit but also if the bucket has been open long enough. Contrary to the
> current RollingSink we don't want to flush all the time if the bucket
> changes but have multiple buckets "open" as needed.
>
> In our case the date to use for partitioning comes from an event field,
> but needs to be formatted, too. The partitioning feature should be generic,
> allowing to pass a function that formats the bucket path for each tuple.
>
> Does it seem like a valid plan to create a sink that internally caches
> multiple rolling sinks?
>
>
> On Tue, May 24, 2016 at 3:50 PM, Kostas Kloudas <
> k.kloudas@data-artisans.com> wrote:
>
>> Hi Juho,
>>
>> If I understand correctly, you want a custom RollingSink that caches some
>> buckets, one for each topic/date key, and whenever the volume of data
>> buffered
>> exceeds a limit, then it flushes to disk, right?
>>
>> If this is the case, then you are right that this is not currently
>> supported
>> out-of-the-box, but it would be interesting to update the RollingSink
>> to support such scenarios.
>>
>> One clarification: when you say that you want partition by date,
>> you mean the date of the event, right? Not the processing time.
>>
>> Kostas
>>
>> > On May 24, 2016, at 1:22 PM, Juho Autio <juho.autio@rovio.com> wrote:
>> >
>> > Could you suggest how to dynamically partition data with Flink
>> streaming?
>> >
>> > We've looked at RollingSink, that takes care of writing batches to S3,
>> but
>> > it doesn't allow defining the partition dynamically based on the tuple
>> > fields.
>> >
>> > Our data is coming from Kafka and essentially has the kafka topic and a
>> > date, among other fields.
>> >
>> > We'd like to consume all topics (also automatically subscribe to new
>> ones)
>> > and write to S3 partitioned by topic and date, for example:
>> >
>> > s3://bucket/path/topic=topic2/date=20160522/
>> > s3://bucket/path/topic=topic2/date=20160523/
>> > s3://bucket/path/topic=topic1/date=20160522/
>> > s3://bucket/path/topic=topic1/date=20160523/
>> >
>> > There are two problems with RollingSink as it is now:
>> > - Only allows partitioning by date
>> > - Flushes the batch every time the path changes. In our case the stream
>> can
>> > for example have a random mix of different topics and that would mean
>> that
>> > RollingSink isn't able to respect the max flush size but keeps flushing
>> the
>> > files pretty much on every tuple.
>> >
>> > We've thought that we could implement a sink that internally creates and
>> > handles multiple RollingSink instances as needed for partitions. But it
>> > would be great to first hear any suggestions that you might have.
>> >
>> > If we have to extend RollingSink, it would be nice to make it take a
>> > partitioning function as a parameter. The function would be called for
>> each
>> > tuple to create the output path.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Dynamic-partitioning-for-stream-output-tp7122.html
>> > Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>

--001a114c8c946b3a360533a4f291
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Related issue: <a href=3D"https://issues.apache.org/jira/b=
rowse/FLINK-2672">https://issues.apache.org/jira/browse/FLINK-2672</a><div =
class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, May 25, 2016 a=
t 9:21 AM, Juho Autio <span dir=3D"ltr">&lt;<a href=3D"mailto:juho.autio@ro=
vio.com" target=3D"_blank">juho.autio@rovio.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks, indeed the desired be=
havior is to flush if bucket size exceeds a limit but also if the bucket ha=
s been open long enough. Contrary to the current RollingSink we don&#39;t w=
ant to flush all the time if the bucket changes but have multiple buckets &=
quot;open&quot; as needed.<div><br></div><div>In our case the date to use f=
or partitioning comes from an event field, but needs to be formatted, too. =
The partitioning feature should be generic, allowing to pass a function tha=
t formats the bucket path for each tuple.</div><div><br></div><div>Does it =
seem like a valid plan to create a sink that internally caches multiple rol=
ling sinks?<div><div class=3D"h5"><br><div class=3D"gmail_extra"><br><div c=
lass=3D"gmail_quote">On Tue, May 24, 2016 at 3:50 PM, Kostas Kloudas <span =
dir=3D"ltr">&lt;<a href=3D"mailto:k.kloudas@data-artisans.com" target=3D"_b=
lank">k.kloudas@data-artisans.com</a>&gt;</span> wrote:<br><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pad=
ding-left:1ex">Hi Juho,<br>
<br>
If I understand correctly, you want a custom RollingSink that caches some<b=
r>
buckets, one for each topic/date key, and whenever the volume of data buffe=
red<br>
exceeds a limit, then it flushes to disk, right?<br>
<br>
If this is the case, then you are right that this is not currently supporte=
d<br>
out-of-the-box, but it would be interesting to update the RollingSink<br>
to support such scenarios.<br>
<br>
One clarification: when you say that you want partition by date,<br>
you mean the date of the event, right? Not the processing time.<br>
<span><font color=3D"#888888"><br>
Kostas<br>
</font></span><div><div><br>
&gt; On May 24, 2016, at 1:22 PM, Juho Autio &lt;<a href=3D"mailto:juho.aut=
io@rovio.com" target=3D"_blank">juho.autio@rovio.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Could you suggest how to dynamically partition data with Flink streami=
ng?<br>
&gt;<br>
&gt; We&#39;ve looked at RollingSink, that takes care of writing batches to=
 S3, but<br>
&gt; it doesn&#39;t allow defining the partition dynamically based on the t=
uple<br>
&gt; fields.<br>
&gt;<br>
&gt; Our data is coming from Kafka and essentially has the kafka topic and =
a<br>
&gt; date, among other fields.<br>
&gt;<br>
&gt; We&#39;d like to consume all topics (also automatically subscribe to n=
ew ones)<br>
&gt; and write to S3 partitioned by topic and date, for example:<br>
&gt;<br>
&gt; s3://bucket/path/topic=3Dtopic2/date=3D20160522/<br>
&gt; s3://bucket/path/topic=3Dtopic2/date=3D20160523/<br>
&gt; s3://bucket/path/topic=3Dtopic1/date=3D20160522/<br>
&gt; s3://bucket/path/topic=3Dtopic1/date=3D20160523/<br>
&gt;<br>
&gt; There are two problems with RollingSink as it is now:<br>
&gt; - Only allows partitioning by date<br>
&gt; - Flushes the batch every time the path changes. In our case the strea=
m can<br>
&gt; for example have a random mix of different topics and that would mean =
that<br>
&gt; RollingSink isn&#39;t able to respect the max flush size but keeps flu=
shing the<br>
&gt; files pretty much on every tuple.<br>
&gt;<br>
&gt; We&#39;ve thought that we could implement a sink that internally creat=
es and<br>
&gt; handles multiple RollingSink instances as needed for partitions. But i=
t<br>
&gt; would be great to first hear any suggestions that you might have.<br>
&gt;<br>
&gt; If we have to extend RollingSink, it would be nice to make it take a<b=
r>
&gt; partitioning function as a parameter. The function would be called for=
 each<br>
&gt; tuple to create the output path.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; --<br>
&gt; View this message in context: <a href=3D"http://apache-flink-user-mail=
ing-list-archive.2336050.n4.nabble.com/Dynamic-partitioning-for-stream-outp=
ut-tp7122.html" rel=3D"noreferrer" target=3D"_blank">http://apache-flink-us=
er-mailing-list-archive.2336050.n4.nabble.com/Dynamic-partitioning-for-stre=
am-output-tp7122.html</a><br>
&gt; Sent from the Apache Flink User Mailing List archive. mailing list arc=
hive at Nabble.com.</div></div></blockquote></div></div></div></div></div><=
/div></blockquote></div>
</div></div>

--001a114c8c946b3a360533a4f291--