flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Flink Serialization as stable (kafka) output format?
Date Sun, 19 Apr 2020 06:23:42 GMT
Hey Theo,

we recently published a blog post that answers your request for a
comparison between Kryo and Avro in Flink:
https://flink.apache.org/news/2020/04/15/flink-serialization-tuning-vol-1.html

On Tue, Mar 10, 2020 at 9:27 AM Arvid Heise <arvid@ververica.com> wrote:

> Hi Theo,
>
> I strongly discourage the use of flink serialization for persistent
> storage of data. It was never intended to work in this way and does not
> offer the benefits of Avro of lazy schema evolution and maturity.
>
> Unless you can explicitly measure that Avro is a bottleneck in your setup,
> stick with it. It's the preferred way to store data in Kafka for a reason.
> It's mature, supports plenty of languages, and the schema evolution feature
> will save you so many headaches in the future.
>
> If it turns out to be a bottleneck, the most logical alternative is
> protobuf. Kryo is even worse than Flink serializer for Kafka. In general,
> realistically speaking, it's so much more cost-effective to just add
> another node to your Flink cluster and use Avro than coming up with any
> clever solution (just assume that you need at least one man month to
> implement and do the math).
>
> And btw, you should always use generated Java/scala classes if possible
> for Avro. It's faster and offers a much nicer development experience.
>
> On Mon, Mar 9, 2020 at 3:57 PM Robert Metzger <rmetzger@apache.org> wrote:
>
>> Hi Theo,
>>
>> However, in most benchmarks, avro turns out to be rather slow in terms of
>>> CPU cycles ( e.g. [1] <https://github.com/eishay/jvm-serializers/wiki>
)
>>
>>
>> Avro is slower compared to what?
>> You should not only benchmark the CPU cycles for serializing the data. If
>> you are sending JSON strings across the network, you'll probably have a lot
>> more bytes to send across the network, making everything slower (usually
>> network is slower than CPU)
>>
>> One of the reasons why people use Avro it supports schema evolution.
>>
>> Regarding your questions:
>> 1. For this use case, you can use the Flink data format as an internal
>> message format (between the star architecture jobs)
>> 2. Generally speaking no
>> 3. You will at leave have a dependency to flink-core. And this is a
>> somewhat custom setup, so you might be facing breaking API changes.
>> 4. I'm not aware of any benchmarks. The Flink serializers are mostly for
>> internal use (between our operators), Kryo is our fallback (to not suffer
>> to much from the not invented here syndrome), while Avro is meant for
>> cross-system serialization.
>>
>> I have the feeling that you can move ahead with using Flink's Pojo
>> serializer everywhere :)
>>
>> Best,
>> Robert
>>
>>
>>
>>
>> On Wed, Mar 4, 2020 at 1:04 PM Theo Diefenthal <
>> theo.diefenthal@scoop-software.de> wrote:
>>
>>> Hi,
>>>
>>> Without knowing too much about flink serialization, I know that Flinks
>>> states that it serializes POJOtypes much faster than even the fast Kryo for
>>> Java. I further know that it supports schema evolution in the same way as
>>> avro.
>>>
>>> In our project, we have a star architecture, where one flink job
>>> produces results into a kafka topic and where we have multiple downstream
>>> consumers from that kafka topic (Mostly other flink jobs).
>>> For fast development cycles, we currently use JSON as output format for
>>> the kafka topic due to easy debugging capabilities and best migration
>>> possibilities. However, when scaling up, we need to switch to a more
>>> efficient format. Most often, Avro is mentioned in combination with a
>>> schema registry, as its much more efficient then JSON where essentially,
>>> each message contains the schema as well. However, in most benchmarks, avro
>>> turns out to be rather slow in terms of CPU cycles ( e.g. [1]
>>> <https://github.com/eishay/jvm-serializers/wiki> )
>>>
>>> My question(s) now:
>>> 1. Is it reasonable to use flink serializers as message format in Kafka?
>>> 2. Are there any downsides in using flinks serialization result as
>>> output format to kafka?
>>> 3. Can downstream consumers, written in Java, but not flink components,
>>> also easily deserialize flink serialized POJOs? Or do they have a
>>> dependency to at least full flink-core?
>>> 4. Do you have benchmarks comparing flink (de-)serialization performance
>>> to e.g. kryo and avro?
>>>
>>> The only thing I come up with why I wouldn't use flink serialization is
>>> that we wouldn't have a schema registry, but in our case, we share all our
>>> POJOs in a jar which is used by all components, so that is kind of a schema
>>> registry already and if we only make avro compatible changes, which are
>>> also well treated by flink, that shouldn't be any limitation compared to
>>> like avro+registry?
>>>
>>> Best regards
>>> Theo
>>>
>>> [1] https://github.com/eishay/jvm-serializers/wiki
>>>
>>

Mime
View raw message