Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <0754067f-435d-4a1f-b690-585314656e0a@cryptolab.net>
References: <CAMX9TYcPQQ=RyuOJ-V0HN_Y0=NtGSs44CEVRFAzh7hBcY3NLBQ@mail.gmail.com>
 <0754067f-435d-4a1f-b690-585314656e0a@cryptolab.net>
From: Dmitry Golubets <dgolubets@gmail.com>
Date: Fri, 17 Feb 2017 20:17:16 +0000
Message-ID: <CAMX9TYfMpZ7Z-PGK1g9sguipfs-hqONGrKntAgcAHPah+vDxqg@mail.gmail.com>
Subject: Re: Performance tuning
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114a532297837c0548bf9919
archived-at: Fri, 17 Feb 2017 20:17:21 -0000

--001a114a532297837c0548bf9919
Content-Type: text/plain; charset=UTF-8

Hi Daniel,

I've implemented a macro that generates message pack serializers in our
codebase.
Resulting code is basically a series of writes\reads like in hand-written
structured serialization.

E.g. given
case class Data1(str: String, subdata: Data2)
case class Data2(num: Int)

serialization code for Data1 will be like:
packer.packString(str)
packer.packInt(num)

The data structures in our project are quite big (2-4kb in json) and
contain nested classes with many fields.
So custom serialization helps us to avoid reflection and reduces data size
to send over the network.

However, it worth mentioning, I see that on small case classes Flink
default serialization works faster.


Best regards,
Dmitry

On Fri, Feb 17, 2017 at 6:01 PM, Daniel Santos <dsantos@cryptolab.net>
wrote:

> Hello Dimitry,
>
> Could you please elaborate on your tuning on -> environment.addDefaultKryoSerializer(..)
> .
>
> I'm interested on knowing what have you done there for a boost of about
> 50% .
>
> Some small or simple example would be very nice.
>
> Thank you very much in advance.
>
> Kind Regards,
>
> Daniel Santos
>
> On 02/17/2017 12:43 PM, Dmitry Golubets wrote:
>
> Hi,
>
> My streaming job cannot benefit much from parallelization unfortunately.
> So I'm looking for things I can tune in Flink, to make it process
> sequential stream faster.
>
> So far in our current engine based on Akka Streams (non distributed ofc)
> we have 20k msg/sec.
> Ported to Flink I'm getting 14k so far.
>
> My observations are following:
>
>    - if I chain operations together they execute all in sequence, so I
>    basically sum up the time required to process one data item across all my
>    stream operators, not good
>    - if I split chains, they execute asynchronously to each other, but
>    there is serialization and network overhead
>
> Second approach gives me better results, considering that I have a server
> with more than enough memory and cores to do all side work for
> serialization. But I want to reduce this serialization\data transfer
> overhead to a minimum.
>
> So what I have now:
>
> environment.getConfig.enableObjectReuse() // cos it's Scala we don't need
> unnecessary serialization
> environment.getConfig.disableAutoTypeRegistration() // it works faster
> with it, I'm not sure why
> environment.addDefaultKryoSerializer(..) // custom Message Pack
> serialization for all message types, gives about 50% boost
>
> But that's it, I don't know what else to do.
> I didn't find any interesting network\buffer settings in docs.
>
> Best regards,
> Dmitry
>
>
>

--001a114a532297837c0548bf9919
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div><div><div><div>Hi Daniel,<br><br>=
</div>I&#39;ve implemented a macro that generates message pack serializers =
in our codebase.<br></div>Resulting code is basically a series of writes\re=
ads like in hand-written structured serialization.<br><br></div>E.g. given<=
br></div>case class Data1(str: String, subdata: Data2)<br></div>case class =
Data2(num: Int)<br><br></div>serialization code for Data1 will be like:<br>=
</div>packer.packString(str)<br>packer.packInt(num)<br><br></div>The data s=
tructures in our project are quite big (2-4kb in json) and contain nested c=
lasses with many fields. <br>So custom serialization helps us to avoid refl=
ection and reduces data size to send over the network.<br><div><div><div><d=
iv><div><br></div><div>However, it worth mentioning, I see that on small ca=
se classes Flink default serialization works faster.<br></div><div><br></di=
v></div></div></div></div></div><div class=3D"gmail_extra"><br clear=3D"all=
"><div><div class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><d=
iv dir=3D"ltr"><div><div dir=3D"ltr"><div style=3D"font-size:small">Best re=
gards,</div><div style=3D"font-size:small">Dmitry</div></div></div></div></=
div></div>
<br><div class=3D"gmail_quote">On Fri, Feb 17, 2017 at 6:01 PM, Daniel Sant=
os <span dir=3D"ltr">&lt;<a href=3D"mailto:dsantos@cryptolab.net" target=3D=
"_blank">dsantos@cryptolab.net</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    <p>Hello Dimitry,</p>
    <p>Could you please elaborate on your tuning on -&gt;
      environment.<wbr>addDefaultKryoSerializer(..) .</p>
    <p>I&#39;m interested on knowing what have you done there for a boost o=
f
      about 50% .</p>
    <p>Some small or simple example would be very nice.</p>
    <p>Thank you very much in advance.</p>
    <p>Kind Regards,</p>
    <p>Daniel Santos<br>
    </p><div><div class=3D"h5">
    <br>
    <div class=3D"m_-7477620178737791938moz-cite-prefix">On 02/17/2017 12:4=
3 PM, Dmitry Golubets
      wrote:<br>
    </div>
    <blockquote type=3D"cite">
      <div dir=3D"ltr">
        <div>
          <div>
            <div>
              <div>Hi,<br>
                <br>
              </div>
              My streaming job cannot benefit much from parallelization
              unfortunately.<br>
            </div>
            So I&#39;m looking for things I can tune in Flink, to make it
            process sequential stream faster.<br>
            <br>
          </div>
          So far in our current engine based on Akka Streams (non
          distributed ofc) we have 20k msg/sec.<br>
        </div>
        Ported to Flink I&#39;m getting 14k so far.<br>
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div><br>
                  </div>
                  <div>My observations are following:<br>
                    <ul>
                      <li>if I chain operations together they execute
                        all in sequence, so I basically sum up the time
                        required to process one data item across all my
                        stream operators, not good</li>
                      <li>if I split chains, they execute asynchronously
                        to each other, but there is serialization and
                        network overhead</li>
                    </ul>
                  </div>
                  <div>Second approach gives me better results,
                    considering that I have a server with more than
                    enough memory and cores to do all side work for
                    serialization. But I want to reduce this
                    serialization\data transfer overhead to a minimum.<br>
                  </div>
                  <div><br>
                    So what I have now:<br>
                    <br>
                    environment.getConfig.<wbr>enableObjectReuse() // cos
                    it&#39;s Scala we don&#39;t need unnecessary serializat=
ion <br>
                    environment.getConfig.<wbr>disableAutoTypeRegistration(=
)
                    // it works faster with it, I&#39;m not sure why<br>
                    environment.<wbr>addDefaultKryoSerializer(..) // custom
                    Message Pack serialization for all message types,
                    gives about 50% boost<br>
                    <br>
                  </div>
                  <div>But that&#39;s it, I don&#39;t know what else to do.=
<br>
                  </div>
                  <div>I didn&#39;t find any interesting network\buffer
                    settings in docs.<br>
                  </div>
                  <div><br clear=3D"all">
                  </div>
                  <div>
                    <div>
                      <div class=3D"m_-7477620178737791938gmail_signature">
                        <div dir=3D"ltr">
                          <div>
                            <div dir=3D"ltr">
                              <div style=3D"font-size:small">Best regards,<=
/div>
                              <div style=3D"font-size:small">Dmitry</div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div>

--001a114a532297837c0548bf9919--