flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek VerLee <derekver...@gmail.com>
Subject Re: Streaming : a way to "key by partition id" without redispatching data
Date Fri, 10 Nov 2017 17:09:37 GMT
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>I was about to ask this question myself.  I find myself re-keying
      by the same keys repeatedly.  I think in principle you could
      always just roll more work into one window operation with a more
      complex series of maps/folds/windowfunctions or processfunction. 
      However this doesn't always feel the most clean or convenient, or
      composible.  It would be great if there was a way to just express
      that you want to keep the same partitions as the last window, or
      that the new key is 1-to-1 with the previous one.  Even more
      generally, if the new key is "based" off the old key in a way that
      is one to one or one to many, in either case it may not be
      necessary to send data over the wire, although in the later case,
      there is a risk of hot-spotting , I suppose.<br>
    </p>
    <div class="moz-cite-prefix">On 11/10/17 12:01 PM, Gwenhael
      Pasquiers wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:VI1PR07MB3357F72FC713FEFAED9A0B72F7540@VI1PR07MB3357.eurprd07.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
	{font-family:Wingdings;
	panose-1:5 0 0 0 0 0 0 0 0 0;}
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:#0563C1;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:#954F72;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0cm;
	margin-right:0cm;
	margin-bottom:0cm;
	margin-left:36.0pt;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
p.msonormal0, li.msonormal0, div.msonormal0
	{mso-style-name:msonormal;
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Times New Roman",serif;}
span.EmailStyle19
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
span.EmailStyle20
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
span.EmailStyle21
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
span.EmailStyle22
	{mso-style-type:personal-reply;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:1148136473;
	mso-list-type:hybrid;
	mso-list-template-ids:83814226 1155811664 67895299 67895301 67895297 67895299 67895301 67895297
67895299 67895301;}
@list l0:level1
	{mso-level-start-at:2;
	mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:20.25pt;
	text-indent:-18.0pt;
	font-family:Symbol;
	mso-fareast-font-family:Calibri;
	mso-bidi-font-family:"Times New Roman";}
@list l0:level2
	{mso-level-number-format:bullet;
	mso-level-text:o;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:56.25pt;
	text-indent:-18.0pt;
	font-family:"Courier New";}
@list l0:level3
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:92.25pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
@list l0:level4
	{mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:128.25pt;
	text-indent:-18.0pt;
	font-family:Symbol;}
@list l0:level5
	{mso-level-number-format:bullet;
	mso-level-text:o;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:164.25pt;
	text-indent:-18.0pt;
	font-family:"Courier New";}
@list l0:level6
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:200.25pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
@list l0:level7
	{mso-level-number-format:bullet;
	mso-level-text:\F0B7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:236.25pt;
	text-indent:-18.0pt;
	font-family:Symbol;}
@list l0:level8
	{mso-level-number-format:bullet;
	mso-level-text:o;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:272.25pt;
	text-indent:-18.0pt;
	font-family:"Courier New";}
@list l0:level9
	{mso-level-number-format:bullet;
	mso-level-text:\F0A7;
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:308.25pt;
	text-indent:-18.0pt;
	font-family:Wingdings;}
ol
	{margin-bottom:0cm;}
ul
	{margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal"><span lang="EN-US">I think I finally found
            a way to “simulate” a Timer thanks to the the
            processWatermark function of the AbstractStreamOperator.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Sorry for the monologue.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0cm 0cm 0cm">
            <p class="MsoNormal"><b><span
                  style="mso-fareast-language:FR" lang="EN-US">From:</span></b><span
                style="mso-fareast-language:FR" lang="EN-US"> Gwenhael
                Pasquiers [<a class="moz-txt-link-freetext" href="mailto:gwenhael.pasquiers@ericsson.com">mailto:gwenhael.pasquiers@ericsson.com</a>]
                <br>
                <b>Sent:</b> vendredi 10 novembre 2017 16:02<br>
                <b>To:</b> '<a class="moz-txt-link-abbreviated" href="mailto:user@flink.apache.org">user@flink.apache.org</a>'
                <a class="moz-txt-link-rfc2396E" href="mailto:user@flink.apache.org">&lt;user@flink.apache.org&gt;</a><br>
                <b>Subject:</b> RE: Streaming : a way to "key by
                partition id" without redispatching data<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Hello,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal"><span lang="EN-US">Finally, even after
            creating my operator, I still get the error : “Timers can
            only be used on keyed operators”.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Isn’t there any way
            around this ? A way to “key” my stream without shuffling the
            data ?<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0cm 0cm 0cm">
            <p class="MsoNormal"><b><span
                  style="mso-fareast-language:FR" lang="EN-US">From:</span></b><span
                style="mso-fareast-language:FR" lang="EN-US"> Gwenhael
                Pasquiers
                <br>
                <b>Sent:</b> vendredi 10 novembre 2017 11:42<br>
                <b>To:</b> Gwenhael Pasquiers &lt;<a
                  href="mailto:gwenhael.pasquiers@ericsson.com"
                  moz-do-not-send="true">gwenhael.pasquiers@ericsson.com</a>&gt;;
                '<a class="moz-txt-link-abbreviated" href="mailto:user@flink.apache.org">user@flink.apache.org</a>'
&lt;<a
                  href="mailto:user@flink.apache.org"
                  moz-do-not-send="true">user@flink.apache.org</a>&gt;<br>
                <b>Subject:</b> RE: Streaming : a way to "key by
                partition id" without redispatching data<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Maybe you don’t need to
            bother with that question.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">I’m currently
            discovering AbstractStreamOperator, OneInputStreamOperator
            and Triggerable.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">That should do it :-)<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0cm 0cm 0cm">
            <p class="MsoNormal"><b><span
                  style="mso-fareast-language:FR" lang="EN-US">From:</span></b><span
                style="mso-fareast-language:FR" lang="EN-US"> Gwenhael
                Pasquiers [<a
                  href="mailto:gwenhael.pasquiers@ericsson.com"
                  moz-do-not-send="true">mailto:gwenhael.pasquiers@ericsson.com</a>]
                <br>
                <b>Sent:</b> jeudi 9 novembre 2017 18:00<br>
                <b>To:</b> '<a class="moz-txt-link-abbreviated" href="mailto:user@flink.apache.org">user@flink.apache.org</a>'
&lt;<a
                  href="mailto:user@flink.apache.org"
                  moz-do-not-send="true">user@flink.apache.org</a>&gt;<br>
                <b>Subject:</b> Streaming : a way to "key by partition
                id" without redispatching data<o:p></o:p></span></p>
          </div>
        </div>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal">Hello,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">(Flink 1.2.1)<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal"><span lang="EN-US">For performances reasons
            I’m trying to reduce the volume of data of my stream as soon
            as possible by windowing/folding it for 15 minutes before
            continuing to the rest of the chain that contains keyBys and
            windows that will transfer data everywhere.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Because of the huge
            volume of data, I want to avoid “moving” the data between
            partitions as much as possible (not like a naïve KeyBy
            does). I wanted to create a custom ProcessFunction (using
            timer and state to fold data for X minutes) in order to fold
            my data over itself before keying the stream but even
            ProcessFunction needs a keyed stream…<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Is there a specific
            “key” value that would ensure me that my data won’t be moved
            to another taskmanager (that it’s hashcode will match the
            partition it is already in) ? I thought about the subtask id
            but I doubt I’d be that lucky :-) <o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Suggestions<o:p></o:p></span></p>
        <p class="MsoListParagraph"
          style="margin-left:20.25pt;text-indent:-18.0pt;mso-list:l0
          level1 lfo2">
          <!--[if !supportLists]--><span style="font-family:Symbol"
            lang="EN-US"><span style="mso-list:Ignore">·<span
                style="font:7.0pt &quot;Times New Roman&quot;">        
              </span></span></span><!--[endif]--><span lang="EN-US">Wouldn’t
            it be useful to be able to do a “partitionnedKeyBy” that
            would not move data between nodes, for windowing operations
            that can be parallelized.<o:p></o:p></span></p>
        <p class="MsoListParagraph"
          style="margin-left:56.25pt;text-indent:-18.0pt;mso-list:l0
          level2 lfo2">
          <!--[if !supportLists]--><span
            style="font-family:&quot;Courier New&quot;" lang="EN-US"><span
              style="mso-list:Ignore">o<span style="font:7.0pt
                &quot;Times New Roman&quot;">  
              </span></span></span><!--[endif]--><span lang="EN-US">Something
            like kafka =&gt; partitionnedKeyBy(0) =&gt; first folding
            =&gt; keyBy(0) =&gt; second folding =&gt; ….<o:p></o:p></span></p>
        <p class="MsoListParagraph"
          style="margin-left:20.25pt;text-indent:-18.0pt;mso-list:l0
          level1 lfo2">
          <!--[if !supportLists]--><span style="font-family:Symbol"
            lang="EN-US"><span style="mso-list:Ignore">·<span
                style="font:7.0pt &quot;Times New Roman&quot;">        
              </span></span></span><!--[endif]--><span lang="EN-US">Finally,
            aren’t all streams keyed ? Even if they’re keyed by a
            totally arbitrary partition id until the user chooses its
            own key, shouldn’t we be able to do a window (not windowAll)
            or process over any normal Stream’s partition ?<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">B.R.<o:p></o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
        <p class="MsoNormal"><span lang="EN-US">Gwenhaël PASQUIERS<o:p></o:p></span></p>
      </div>
    </blockquote>
    <br>
  </body>
</html>

Mime
View raw message