beam-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "peay (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (BEAM-1573) KafkaIO does not allow using Kafka serializers and deserializers
Date Thu, 02 Mar 2017 18:38:45 GMT

    [ https://issues.apache.org/jira/browse/BEAM-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892746#comment-15892746
] 

peay edited comment on BEAM-1573 at 3/2/17 6:38 PM:
----------------------------------------------------

My concern is for both source and sink.

I'd like to be able to use custom {{org.apache.kafka.common.serialization.Serializer,Deserializer}}.
An example is http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer
for working with setups where Kafka topics contain Avro serialized using an Avro schema registry.
This uses a {{Serializer/Deserializer<Object>}} but I also have similar Kafka serializers
with arbitrary types.

The interfaces of the encoding/decoding methods in {{org.apache.kafka.common.serialization.Serializer,Deserializer}}
are:
- {{serialize(String topic, byte[] data)}}
- {{deserialize(String topic, byte[] data)}}.

I would like to be able to support a syntax like this:
{code} 
KafkaIO
.read()

.withBootstrapServers(this.broker)

.withTopics(ImmutableList.of(this.topic))

.withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(), AvroCoder.of(xxx))

.withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()), AvroCoder.of(xxx))

KafkaIO

.write()

.withBootstrapServers(this.broker)

.withTopic(this.topic)

.withCustomKafkaValueSerializer(new SomeCustomDeserializer())

.withCustomKafkaKeySerializer(new SomeCustomDeserializer()))
{code}

In both case, Kafka would use the custom serializer/deserializer directly.

Now, why is it hard to express currently? KafkaIO seems to be implemented differently for
read and write, so let us consider the two cases. I have a working patch for the above syntax,
that is straightforward for writes, but requires a bunch of changes for reads...

For write, the Coder is wrapped into an actual {{org.apache.kafka.common.serialization.Serializer}}
through {{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to pass
it manually the topic name. Also, we end up with a wrapper for a Kafka serializer, wrapped
in a Coder, itself wrapped in a Kafka serializer. 

Reads are implemented differently. I am not sure why? Instead of wrapping the coders into
a Kafka deserializer, everything is hard wired to use `byte[]` Kafka consumer. Then, KakfaIO
calls the coder after data has been returned by the consumer. Here also, one can make a {{CustomCoder}}.
This won't work if a list of topics is used as input to KafkaIO, and still requires to pass
in the topic name manually when there's only here. In the example snippet above, I also include
a second argument that is a coder, to use with {{setCoder}} for setting up the rest of the
pipeline.

In both cases, wrapping the Kafka serializer into the Coder is also not obvious because Kafka
serializers have a {{configure}} method to give them access to the consumer/producer config,
so this possibly needs to be emulated in the coder wrapper.

What do you think? 


was (Author: peay):
My concern is for both source and sink.

I'd like to be able to use custom {{org.apache.kafka.common.serialization.Serializer,Deserializer}}s.
An example is http://docs.confluent.io/2.0.0/schema-registry/docs/serializer-formatter.html#serializer
for working with setups where Kafka topics contain Avro serialized using an Avro schema registry.
This uses a {{Serializer/Deserializer<Object>}} but I also have similar Kafka serializers
with arbitrary types.

The interfaces of the encoding/decoding methods in {{org.apache.kafka.common.serialization.Serializer,Deserializer}}
are:
- {{serialize(String topic, byte[] data)}}
- {{deserialize(String topic, byte[] data)}}.

I would like to be able to support a syntax like this:
{code} 
KafkaIO
.read()

.withBootstrapServers(this.broker)

.withTopics(ImmutableList.of(this.topic))

.withCustomKafkaValueDeserializerAndCoder(new SomeCustomKafkaDeserializer(), AvroCoder.of(xxx))

.withCustomKafkaKeyDeserializerAndCoder(new SomeCustomKafkaDeserializer()), AvroCoder.of(xxx))

KafkaIO

.write()

.withBootstrapServers(this.broker)

.withTopic(this.topic)

.withCustomKafkaValueSerializer(new SomeCustomDeserializer())

.withCustomKafkaKeySerializer(new SomeCustomDeserializer()))
{code}

In both case, Kafka would use the custom serializer/deserializer directly.

Now, why is it hard to express currently? KafkaIO seems to be implemented differently for
read and write, so let us consider the two cases. I have a working patch for the above syntax,
that is straightforward for writes, but requires a bunch of changes for reads...

For write, the Coder is wrapped into an actual {{org.apache.kafka.common.serialization.Serializer}}
through {{CoderBasedKafkaSerializer}}. I can make a {{CustomCoder}}, but still have to pass
it manually the topic name. Also, we end up with a wrapper for a Kafka serializer, wrapped
in a Coder, itself wrapped in a Kafka serializer. 

Reads are implemented differently. I am not sure why? Instead of wrapping the coders into
a Kafka deserializer, everything is hard wired to use `byte[]` Kafka consumer. Then, KakfaIO
calls the coder after data has been returned by the consumer. Here also, one can make a {{CustomCoder}}.
This won't work if a list of topics is used as input to KafkaIO, and still requires to pass
in the topic name manually when there's only here. In the example snippet above, I also include
a second argument that is a coder, to use with {{setCoder}} for setting up the rest of the
pipeline.

In both cases, wrapping the Kafka serializer into the Coder is also not obvious because Kafka
serializers have a {{configure}} method to give them access to the consumer/producer config,
so this possibly needs to be emulated in the coder wrapper.

What do you think? 

> KafkaIO does not allow using Kafka serializers and deserializers
> ----------------------------------------------------------------
>
>                 Key: BEAM-1573
>                 URL: https://issues.apache.org/jira/browse/BEAM-1573
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>    Affects Versions: 0.4.0, 0.5.0
>            Reporter: peay
>            Assignee: Raghu Angadi
>            Priority: Minor
>
> KafkaIO does not allow to override the serializer and deserializer settings of the Kafka
consumer and producers it uses internally. Instead, it allows to set a `Coder`, and has a
simple Kafka serializer/deserializer wrapper class that calls the coder.
> I appreciate that allowing to use Beam coders is good and consistent with the rest of
the system. However, is there a reason to completely disallow to use custom Kafka serializers
instead?
> This is a limitation when working with an Avro schema registry for instance, which requires
custom serializers. One can write a `Coder` that wraps a custom Kafka serializer, but that
means two levels of un-necessary wrapping.
> In addition, the `Coder` abstraction is not equivalent to Kafka's `Serializer` which
gets the topic name as input. Using a `Coder` wrapper would require duplicating the output
topic setting in the argument to `KafkaIO` and when building the wrapper, which is not elegant
and error prone.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message