spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: [discuss] Data Source V2 write path
Date Wed, 20 Sep 2017 16:52:47 GMT
On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <> wrote:

> Hi all,
> I want to have some discussion about Data Source V2 write path before
> starting a voting.
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not good
> for maintenance.
> 2. Data sources may need to preprocess the input data before writing, like
> cluster/sort the input by some columns. It's better to do the preprocessing
> in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
> To solve these pain points, I'm proposing a data source writing framework
> which is very similar to the reading framework, i.e., WriteSupport ->
> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
> prototype to see what it looks like:
> apache/spark/pull/19269
> There are some other details need further discussion:
> 1. *partitioning/bucketing*
> Currently only the built-in file-based data sources support them, but
> there is nothing stopping us from exposing them to all data sources. One
> question is, shall we make them as mix-in interfaces for data source v2
> reader/writer, or just encode them into data source options(a
> string-to-string map)? Ideally it's more like options, Spark just transfers
> these user-given informations to data sources, and doesn't do anything for
> it.

I'd just pass them as options, until there are clear (and strong) use cases
to do them otherwise.

+1 on the rest.

> 2. *input data requirement*
> Data sources should be able to ask Spark to preprocess the input data, and
> this can be a mix-in interface for DataSourceV2Writer. I think we need to
> add clustering request and sorting within partitions request, any more?
> 3. *transaction*
> I think we can just follow `FileCommitProtocol`, which is the internal
> framework Spark uses to guarantee transaction for built-in file-based data
> sources. Generally speaking, we need task level and job level commit/abort.
> Again you can see more details in my prototype about it:
> 4. *data source table*
> This is the trickiest one. In Spark you can create a table which points to
> a data source, so you can read/write this data source easily by referencing
> the table name. Ideally data source table is just a pointer which points to
> a data source with a list of predefined options, to save users from typing
> these options again and again for each query.
> If that's all, then everything is good, we don't need to add more
> interfaces to Data Source V2. However, data source tables provide special
> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
> sources to have some extra ability.
> Currently these special operators only work for built-in file-based data
> sources, and I don't think we will extend it in the near future, I propose
> to mark them as out of the scope.
> Any comments are welcome!
> Thanks,
> Wenchen

View raw message