spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jungtaek Lim (JIRA)" <>
Subject [jira] [Commented] (SPARK-25937) Support user-defined schema in Kafka Source & Sink
Date Mon, 05 Nov 2018 05:12:00 GMT


Jungtaek Lim commented on SPARK-25937:

Another thought for my side: maybe we can classify various formats into two, which one is
applied to whole file, whereas another one is applied to each line/record. Once we classify
them, formats which can be applied to Kafka will be latter case, then we could address them
as like JSON function (from_json / to_json). 

After adding them as functions, they can be used widely and don't require data source to be
aware of data format. (If we want to apply pushdown to data source, we may want to let data
source be aware of data format.)

> Support user-defined schema in Kafka Source & Sink
> --------------------------------------------------
>                 Key: SPARK-25937
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: Jackey Lee
>            Priority: Major
>     Kafka Source & Sink is widely used in Spark and has the highest frequency in
streaming production environment. But at present, both Kafka Source and Link use the fixed
schema, which force user to do data conversion when reading and writing Kafka. So why not
we use fileformat to do this just like hive?
>     Flink has implemented Kafka's Json/Csv/Avro extended Source & Sink, we can
also support it in Spark.
> *Main Goals:*
> 1. Provide a Source and Sink that support user defined Schema. Users can read and write
Kafka directly in the program without additional data conversion.
> 2. Provides read-write mechanism based on FileFormat. User's data conversion is similar
to FileFormat's read and write process, we can provide a mechanism similar to FileFormat,
which provide common read-write format conversion. It also allow users to customize format

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message