Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAO4re1knEQZbv3gAVPaLmb7YsWuM=GR+my_R50i8EiF5N85S+w@mail.gmail.com>
References: <CAEQOfPNnU3nFfmr0tugXC+8zuBi8BgAxFsa+_SP0peD=HKCirg@mail.gmail.com>
 <CAO4re1knEQZbv3gAVPaLmb7YsWuM=GR+my_R50i8EiF5N85S+w@mail.gmail.com>
From: Wenchen Fan <cloud0fan@gmail.com>
Date: Mon, 25 Sep 2017 10:17:10 +0800
Message-ID: <CAEQOfPOkD1XQH0DRDFAh=Ljf26BUztYWsuUMOekrSMJRFr4PEg@mail.gmail.com>
Subject: Re: [discuss] Data Source V2 write path
To: Ryan Blue <rblue@netflix.com>
Cc: Spark dev list <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary="f403045ec30afba7fa0559fa279b"
archived-at: Mon, 25 Sep 2017 02:17:17 -0000

--f403045ec30afba7fa0559fa279b
Content-Type: text/plain; charset="UTF-8"

I agree it would be a clean approach if data source is only responsible to
write into an already-configured table. However, without catalog
federation, Spark doesn't have an API to ask an external system(like
Cassandra) to create a table. Currently it's all done by data source write
API. Data source implementations are responsible to create or insert a
table according to the save mode.

As a workaround, I think it's acceptable to pass partitioning/bucketing
information via data source options, and data sources should decide to take
these informations and create the table, or throw exception if these
informations don't match the already-configured table.


On Fri, Sep 22, 2017 at 9:35 AM, Ryan Blue <rblue@netflix.com> wrote:

> > input data requirement
>
> Clustering and sorting within partitions are a good start. We can always
> add more later when they are needed.
>
> The primary use case I'm thinking of for this is partitioning and
> bucketing. If I'm implementing a partitioned table format, I need to tell
> Spark to cluster by my partition columns. Should there also be a way to
> pass those columns separately, since they may not be stored in the same way
> like partitions are in the current format?
>
> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>> Hi all,
>>
>> I want to have some discussion about Data Source V2 write path before
>> starting a voting.
>>
>> The Data Source V1 write path asks implementations to write a DataFrame
>> directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not good
>> for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>>
>> To solve these pain points, I'm proposing a data source writing framework
>> which is very similar to the reading framework, i.e., WriteSupport ->
>> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
>> prototype to see what it looks like: https://github.com/apach
>> e/spark/pull/19269
>>
>> There are some other details need further discussion:
>> 1. *partitioning/bucketing*
>> Currently only the built-in file-based data sources support them, but
>> there is nothing stopping us from exposing them to all data sources. One
>> question is, shall we make them as mix-in interfaces for data source v2
>> reader/writer, or just encode them into data source options(a
>> string-to-string map)? Ideally it's more like options, Spark just transfers
>> these user-given informations to data sources, and doesn't do anything for
>> it.
>>
>> 2. *input data requirement*
>> Data sources should be able to ask Spark to preprocess the input data,
>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>> to add clustering request and sorting within partitions request, any more?
>>
>> 3. *transaction*
>> I think we can just follow `FileCommitProtocol`, which is the internal
>> framework Spark uses to guarantee transaction for built-in file-based data
>> sources. Generally speaking, we need task level and job level commit/abort.
>> Again you can see more details in my prototype about it:
>> https://github.com/apache/spark/pull/19269
>>
>> 4. *data source table*
>> This is the trickiest one. In Spark you can create a table which points
>> to a data source, so you can read/write this data source easily by
>> referencing the table name. Ideally data source table is just a pointer
>> which points to a data source with a list of predefined options, to save
>> users from typing these options again and again for each query.
>> If that's all, then everything is good, we don't need to add more
>> interfaces to Data Source V2. However, data source tables provide special
>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>> sources to have some extra ability.
>> Currently these special operators only work for built-in file-based data
>> sources, and I don't think we will extend it in the near future, I propose
>> to mark them as out of the scope.
>>
>>
>> Any comments are welcome!
>> Thanks,
>> Wenchen
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

--f403045ec30afba7fa0559fa279b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I agree it would be a clean approach if data source is onl=
y responsible to write into an already-configured table. However, without c=
atalog federation, Spark doesn&#39;t have an API to ask an=C2=A0external sy=
stem(like Cassandra) to create a table. Currently it&#39;s all done by data=
 source write API. Data source implementations are responsible to create or=
 insert a table according to the save mode.<div><br></div><div>As a workaro=
und, I think it&#39;s acceptable to pass partitioning/bucketing information=
 via data source options, and data sources should decide to take these info=
rmations=C2=A0and create the table, or throw exception=C2=A0if these inform=
ations don&#39;t match the already-configured table.</div><div><br></div></=
div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Sep 2=
2, 2017 at 9:35 AM, Ryan Blue <span dir=3D"ltr">&lt;<a href=3D"mailto:rblue=
@netflix.com" target=3D"_blank">rblue@netflix.com</a>&gt;</span> wrote:<br>=
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">&gt;=C2=A0input data requir=
ement<div><br></div><div>Clustering and sorting within partitions are a goo=
d start. We can always add more later when they are needed.</div><div><br><=
/div><div>The primary use case I&#39;m thinking of for this is partitioning=
 and bucketing. If I&#39;m implementing a partitioned table format, I need =
to tell Spark to cluster by my partition columns. Should there also be a wa=
y to pass those columns separately, since they may not be stored in the sam=
e way like partitions are in the current format?</div></div><div class=3D"g=
mail_extra"><br><div class=3D"gmail_quote"><span class=3D"">On Wed, Sep 20,=
 2017 at 3:10 AM, Wenchen Fan <span dir=3D"ltr">&lt;<a href=3D"mailto:cloud=
0fan@gmail.com" target=3D"_blank">cloud0fan@gmail.com</a>&gt;</span> wrote:=
<br></span><div><div class=3D"h5"><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">Hi all,<div><br></div><div>I want to have some discussion about Da=
ta Source V2 write path before starting a voting.</div><div><br></div><div>=
The Data Source V1 write path asks implementations to write a DataFrame dir=
ectly, which is painful:</div><div>1. Exposing upper-level API like DataFra=
me to Data Source API is not good for maintenance.</div><div>2. Data source=
s may need to preprocess the input data before writing, like cluster/sort t=
he input by some columns. It&#39;s better to do the preprocessing in Spark =
instead of in the data source.</div><div>3. Data sources need to take care =
of transaction themselves, which is hard. And different data sources may co=
me up with a very similar approach for the=C2=A0transaction, which leads to=
 many duplicated codes.</div><div><br></div><div><br></div><div>To solve th=
ese pain points, I&#39;m proposing a data source writing framework which is=
 very similar to the reading framework, i.e., WriteSupport -&gt; DataSource=
V2Writer -&gt; WriteTask -&gt; DataWriter. You can take a look at my protot=
ype to see what it looks like:=C2=A0<a href=3D"https://github.com/apache/sp=
ark/pull/19269" target=3D"_blank">https://github.com/apach<wbr>e/spark/pull=
/19269</a></div><div><br></div><div>There are some other details need furth=
er discussion:</div><div>1. <b>partitioning/bucketing</b></div><div>Current=
ly only the built-in file-based data sources support them, but there is not=
hing stopping us from exposing them to all data sources. One question is, s=
hall we make them as mix-in interfaces for data source v2 reader/writer, or=
 just encode them into data source options(a string-to-string map)? Ideally=
 it&#39;s more like options, Spark just transfers these user-given informat=
ions to data sources, and doesn&#39;t do anything for it.</div><div><br></d=
iv><div>2. <b>input data requirement</b></div><div>Data sources should be a=
ble to ask Spark to preprocess the input data, and this can be a mix-in int=
erface for DataSourceV2Writer. I think we need to add clustering request an=
d sorting within partitions request, any more?</div><div><br></div><div>3. =
<b>transaction</b></div><div>I think we can just follow `FileCommitProtocol=
`, which is the internal framework Spark uses to guarantee transaction for =
built-in file-based data sources. Generally speaking, we need task level an=
d job level commit/abort. Again you can see more details in my prototype ab=
out it: <a href=3D"https://github.com/apache/spark/pull/19269" target=3D"_b=
lank">https://github.com/apache/spar<wbr>k/pull/19269</a></div><div><br></d=
iv><div>4. <b>data source table</b></div><div>This is the trickiest one. In=
 Spark you can create a table which points to a data source, so you can rea=
d/write this data source easily by referencing the table name. Ideally data=
 source table is just a pointer which points to a data source with a list o=
f predefined options, to save users from typing these options again and aga=
in for each query.</div><div>If that&#39;s all, then everything is good, we=
 don&#39;t need to add more interfaces to Data Source V2. However, data sou=
rce tables provide special operators like ALTER TABLE SCHEMA, ADD PARTITION=
, etc., which requires data sources to have some extra ability.</div><div>C=
urrently these special operators only work for built-in file-based data sou=
rces, and I don&#39;t think we will extend it=C2=A0in the near future, I pr=
opose to mark them as out of the scope.</div><div><br></div><div><br></div>=
<div>Any comments are welcome!</div><div>Thanks,</div><div>Wenchen</div></d=
iv>
</blockquote></div></div></div><br><br clear=3D"all"><span class=3D""><div>=
<br></div>-- <br><div class=3D"m_-1130539110063045993gmail_signature" data-=
smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr">Ryan B=
lue<div>Software Engineer</div><div><span style=3D"font-size:12.8px">Netfli=
x</span></div></div></div></div></div>
</span></div>
</blockquote></div><br></div>

--f403045ec30afba7fa0559fa279b--