Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CAO4re1kDR+cxQTxxZb6Ffx7zynY=acEdJzeHtA61RALe6rSnJw@mail.gmail.com>
References: <CAEQOfPPmDAOAZyvfXAtzMGWuqjeN4pci_VHTb_ign7atsi-how@mail.gmail.com>
 <CAKWX9VWpJ+ssYhTw5uJC=t_DqaOFfKQPACmngwQQ=R8aLSUCAA@mail.gmail.com>
 <CAJ5_7EK=XTC+r6uV58VTGKOsOU5kMd7vkh-+og6RkAsvCrfhHg@mail.gmail.com>
 <CAA2nO86so6RrkgRKwDRcdyV+EsNKXBxvrhje3rzvupJ+e6wJ1Q@mail.gmail.com>
 <AMXPR04MB182F5C5BBB7AE2C2A0F2452E79F0@AMXPR04MB182.eurprd04.prod.outlook.com>
 <CAPh_B=b8HoKGnnfSxVfbdXnH4cddLnqTWOz90EoWXnRg4Sp9ow@mail.gmail.com>
 <DB3PR04MB18579E38569F5429A320488E79F0@DB3PR04MB185.eurprd04.prod.outlook.com>
 <CAEQOfPMDoUE3mwsfTMGJ_pj9qXLzQPXsA0V-Xg+pFrgggTkg2w@mail.gmail.com>
 <AMXPR04MB182CE7A35D5C8903FB72310E79C0@AMXPR04MB182.eurprd04.prod.outlook.com>
 <CAEQOfPOK3YmE+Vd9qzx3GAi=UhmnNZvE7qMB0NOnZV0YS-KJiQ@mail.gmail.com>
 <CAPh_B=YO39xsrt2=p_efYJfr1euNCRxMZzyba+XxFBeGhGP9hQ@mail.gmail.com>
 <CAO4re1kpbHAo8fSg7U35RqDXEedVRYibAPCzu9fxbGz_JF3r+Q@mail.gmail.com>
 <CAPh_B=Y15YuxG64VvfF3QNzE-Mc3Gofo40U_y7kQY5cQRzuddA@mail.gmail.com>
 <CAO4re1kf_NgRQQXGuktTVSxaJVP5q3fj7O_QX5Y3OOq_igpdLA@mail.gmail.com>
 <CAEQOfPOAf8WC=R6+jYuWitw=GKqmjwjv42DE-HLxON-s-4s0sg@mail.gmail.com> <CAO4re1kDR+cxQTxxZb6Ffx7zynY=acEdJzeHtA61RALe6rSnJw@mail.gmail.com>
From: Wenchen Fan <cloud0fan@gmail.com>
Date: Thu, 7 Sep 2017 10:32:44 +0800
Message-ID: <CAEQOfPP2z2m+kVC5Hg9F=AFXhT7iV+xrRvUMyF2he-gUom+S8w@mail.gmail.com>
Subject: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2
To: Ryan Blue <rblue@netflix.com>
Cc: Reynold Xin <rxin@databricks.com>, James Baker <j.baker@outlook.com>,
	Spark dev list <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary="f40304366f9a845e1b055890464a"
archived-at: Thu, 07 Sep 2017 02:33:14 -0000

--f40304366f9a845e1b055890464a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Ryan,

Yea I agree with you that we should discuss some substantial details during
the vote, and I addressed your comments about schema inference API in my
new PR, please take a look.

I've also called a new vote for the read path, please vote there, thanks!

On Thu, Sep 7, 2017 at 7:55 AM, Ryan Blue <rblue@netflix.com> wrote:

> I'm all for keeping this moving and not getting too far into the details
> (like naming), but I think the substantial details should be clarified
> first since they are in the proposal that's being voted on.
>
> I would prefer moving the write side to a separate SPIP, too, since there
> isn't much detail in the proposal and I think we should be more deliberat=
e
> with things like schema evolution.
>
> On Thu, Aug 31, 2017 at 10:33 AM, Wenchen Fan <cloud0fan@gmail.com> wrote=
:
>
>> Hi Ryan,
>>
>> I think for a SPIP, we should not worry too much about details, as we ca=
n
>> discuss them during PR review after the vote pass.
>>
>> I think we should focus more on the overall design, like James did. The
>> interface mix-in vs plan push down discussion was great, hope we can get=
 a
>> consensus on this topic soon. The current proposal is, we keep the
>> interface mix-in framework, and add an unstable plan push down trait.
>>
>> For details like interface names, sort push down vs sort propagate, etc.=
,
>> I think they should not block the vote, as they can be updated/improved
>> within the current interface mix-in framework.
>>
>> About separating read/write proposals, we should definitely send
>> individual PRs for read/write when developing data source v2. I'm also O=
K
>> with voting on the read side first. The write side is way simpler than t=
he
>> read side, I think it's more important to get agreement on the read side
>> first.
>>
>> BTW, I do appreciate your feedbacks/comments on the prototype, let's kee=
p
>> the discussion there. In the meanwhile, let's have more discussion on th=
e
>> overall framework, and drive this project together.
>>
>> Wenchen
>>
>>
>>
>> On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue <rblue@netflix.com> wrote:
>>
>>> Maybe I'm missing something, but the high-level proposal consists of:
>>> Goals, Non-Goals, and Proposed API. What is there to discuss other than=
 the
>>> details of the API that's being proposed? I think the goals make sense,=
 but
>>> goals alone aren't enough to approve a SPIP.
>>>
>>> On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin <rxin@databricks.com>
>>> wrote:
>>>
>>>> So we seem to be getting into a cycle of discussing more about the
>>>> details of APIs than the high level proposal. The details of APIs are
>>>> important to debate, but those belong more in code reviews.
>>>>
>>>> One other important thing is that we should avoid API design by
>>>> committee. While it is extremely useful to get feedback, understand th=
e use
>>>> cases, we cannot do API design by incorporating verbatim the union of
>>>> everybody's feedback. API design is largely a tradeoff game. The most
>>>> expressive API would also be harder to use, or sacrifice backward/forw=
ard
>>>> compatibility. It is as important to decide what to exclude as what to
>>>> include.
>>>>
>>>> Unlike the v1 API, the way Wenchen's high level V2 framework is
>>>> proposed makes it very easy to add new features (e.g. clustering
>>>> properties) in the future without breaking any APIs. I'd rather us shi=
pping
>>>> something useful that might not be the most comprehensive set, than
>>>> debating about every single feature we should add and then creating
>>>> something super complicated that has unclear value.
>>>>
>>>>
>>>>
>>>> On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue <rblue@netflix.com> wrote:
>>>>
>>>>> -1 (non-binding)
>>>>>
>>>>> Sometimes it takes a VOTE thread to get people to actually read and
>>>>> comment, so thanks for starting this one=E2=80=A6 but there=E2=80=99s=
 still discussion
>>>>> happening on the prototype API, which it hasn=E2=80=99t been updated.=
 I=E2=80=99d like to
>>>>> see the proposal shaped by the ongoing discussion so that we have a b=
etter,
>>>>> more concrete plan. I think that=E2=80=99s going to produces a better=
 SPIP.
>>>>>
>>>>> The second reason for -1 is that I think the read- and write-side
>>>>> proposals should be separated. The PR
>>>>> <https://github.com/cloud-fan/spark/pull/10> currently has =E2=80=9Cw=
rite
>>>>> path=E2=80=9D listed as a TODO item and most of the discussion I=E2=
=80=99ve seen is on the
>>>>> read side. I think it would be better to separate the read and write =
APIs
>>>>> so we can focus on them individually.
>>>>>
>>>>> An example of why we should focus on the write path separately is tha=
t
>>>>> the proposal says this:
>>>>>
>>>>> Ideally partitioning/bucketing concept should not be exposed in the
>>>>> Data Source API V2, because they are just techniques for data skippin=
g and
>>>>> pre-partitioning. However, these 2 concepts are already widely used i=
n
>>>>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTI=
TION.
>>>>> To be consistent, we need to add partitioning/bucketing to Data Sourc=
e V2 .
>>>>> . .
>>>>>
>>>>> Essentially, the some APIs mix DDL and DML operations. I=E2=80=99d li=
ke to
>>>>> consider ways to fix that problem instead of carrying the problem for=
ward
>>>>> to Data Source V2. We can solve this by adding a high-level API for D=
DL and
>>>>> a better write/insert API that works well with it. Clearly, that disc=
ussion
>>>>> is independent of the read path, which is why I think separating the =
two
>>>>> proposals would be a win.
>>>>>
>>>>> rb
>>>>> =E2=80=8B
>>>>>
>>>>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> That might be good to do, but seems like orthogonal to this effort
>>>>>> itself. It would be a completely different interface.
>>>>>>
>>>>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan <cloud0fan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> OK I agree with it, how about we add a new interface to push down
>>>>>>> the query plan, based on the current framework? We can mark the
>>>>>>> query-plan-push-down interface as unstable, to save the effort of d=
esigning
>>>>>>> a stable representation of query plan and maintaining forward compa=
tibility.
>>>>>>>
>>>>>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker <j.baker@outlook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'll just focus on the one-by-one thing for now - it's the thing
>>>>>>>> that blocks me the most.
>>>>>>>>
>>>>>>>> I think the place where we're most confused here is on the cost of
>>>>>>>> determining whether I can push down a filter. For me, in order to =
work out
>>>>>>>> whether I can push down a filter or satisfy a sort, I might have t=
o read
>>>>>>>> plenty of data. That said, it's worth me doing this because I can =
use this
>>>>>>>> information to avoid reading >>that much data.
>>>>>>>>
>>>>>>>> If you give me all the orderings, I will have to read that data
>>>>>>>> many times (we stream it to avoid keeping it in memory).
>>>>>>>>
>>>>>>>> There's also a thing where our typical use cases have many filters
>>>>>>>> (20+ is common). So, it's likely not going to work to pass us all =
the
>>>>>>>> combinations. That said, if I can tell you a cost, I know what opt=
imal
>>>>>>>> looks like, why can't I just pick that myself?
>>>>>>>>
>>>>>>>> The current design is friendly to simple datasources, but does not
>>>>>>>> have the potential to support this.
>>>>>>>>
>>>>>>>> So the main problem we have with datasources v1 is that it's
>>>>>>>> essentially impossible to leverage a bunch of Spark features - I d=
on't get
>>>>>>>> to use bucketing or row batches or all the nice things that I real=
ly want
>>>>>>>> to use to get decent performance. Provided I can leverage these in=
 a
>>>>>>>> moderately supported way which won't break in any given commit, I'=
ll be
>>>>>>>> pretty happy with anything that lets me opt out of the restriction=
s.
>>>>>>>>
>>>>>>>> My suggestion here is that if you make a mode which works well for
>>>>>>>> complicated use cases, you end up being able to write simple mode =
in terms
>>>>>>>> of it very easily. So we could actually provide two APIs, one that=
 lets
>>>>>>>> people who have more interesting datasources leverage the cool Spa=
rk
>>>>>>>> features, and one that lets people who just want to implement basi=
c
>>>>>>>> features do that - I'd try to include some kind of layering here. =
I could
>>>>>>>> probably sketch out something here if that'd be useful?
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0fan@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi James,
>>>>>>>>>
>>>>>>>>> Thanks for your feedback! I think your concerns are all valid, bu=
t
>>>>>>>>> we need to make a tradeoff here.
>>>>>>>>>
>>>>>>>>> > Explicitly here, what I'm looking for is a convenient mechanism
>>>>>>>>> to accept a fully specified set of arguments
>>>>>>>>>
>>>>>>>>> The problem with this approach is: 1) if we wanna add more
>>>>>>>>> arguments in the future, it's really hard to do without changing
>>>>>>>>> the existing interface. 2) if a user wants to implement a very si=
mple data
>>>>>>>>> source, he has to look at all the arguments and understand them, =
which may
>>>>>>>>> be a burden for him.
>>>>>>>>> I don't have a solution to solve these 2 problems, comments are
>>>>>>>>> welcome.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > There are loads of cases like this - you can imagine someone
>>>>>>>>> being able to push down a sort before a filter is applied, but no=
t
>>>>>>>>> afterwards. However, maybe the filter is so selective that it's b=
etter to
>>>>>>>>> push down the filter and not handle the sort. I don't get to make=
 this
>>>>>>>>> decision, Spark does (but doesn't have good enough information to=
 do it
>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I p=
ush down
>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let =
me do
>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this =
way.
>>>>>>>>>
>>>>>>>>> This is true, the current framework applies push downs one by one=
,
>>>>>>>>> incrementally. If a data source wanna go back to accept a sort pu=
sh down
>>>>>>>>> after it accepts a filter push down, it's impossible with the cur=
rent data
>>>>>>>>> source V2.
>>>>>>>>> Fortunately, we have a solution for this problem. At Spark side,
>>>>>>>>> actually we do have a fully specified set of arguments waiting to
>>>>>>>>> be pushed down, but Spark doesn't know which is the best order to=
 push them
>>>>>>>>> into data source. Spark can try every combination and ask the dat=
a source
>>>>>>>>> to report a cost, then Spark can pick the best combination with t=
he lowest
>>>>>>>>> cost. This can also be implemented as a cost report interface, so=
 that
>>>>>>>>> advanced data source can implement it for optimal performance, an=
d simple
>>>>>>>>> data source doesn't need to care about it and keep simple.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The current design is very friendly to simple data source, and ha=
s
>>>>>>>>> the potential to support complex data source, I prefer the curren=
t design
>>>>>>>>> over the plan push down one. What do you think?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker <j.baker@outlook.com=
>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Yeah, for sure.
>>>>>>>>>>
>>>>>>>>>> With the stable representation - agree that in the general case
>>>>>>>>>> this is pretty intractable, it restricts the modifications that =
you can do
>>>>>>>>>> in the future too much. That said, it shouldn't be as hard if yo=
u restrict
>>>>>>>>>> yourself to the parts of the plan which are supported by the dat=
asources V2
>>>>>>>>>> API (which after all, need to be translateable properly into the=
 future to
>>>>>>>>>> support the mixins proposed). This should have a pretty small sc=
ope in
>>>>>>>>>> comparison. As long as the user can bail out of nodes they don't
>>>>>>>>>> understand, they should be ok, right?
>>>>>>>>>>
>>>>>>>>>> That said, what would also be fine for us is a place to plug int=
o
>>>>>>>>>> an unstable query plan.
>>>>>>>>>>
>>>>>>>>>> Explicitly here, what I'm looking for is a convenient mechanism
>>>>>>>>>> to accept a fully specified set of arguments (of which I can cho=
ose to
>>>>>>>>>> ignore some), and return the information as to which of them I'm=
 ignoring.
>>>>>>>>>> Taking a query plan of sorts is a way of doing this which IMO is=
 intuitive
>>>>>>>>>> to the user. It also provides a convenient location to plug in t=
hings like
>>>>>>>>>> stats. Not at all married to the idea of using a query plan here=
; it just
>>>>>>>>>> seemed convenient.
>>>>>>>>>>
>>>>>>>>>> Regarding the users who just want to be able to pump data into
>>>>>>>>>> Spark, my understanding is that replacing isolated nodes in a qu=
ery plan is
>>>>>>>>>> easy. That said, our goal here is to be able to push down as muc=
h as
>>>>>>>>>> possible into the underlying datastore.
>>>>>>>>>>
>>>>>>>>>> To your second question:
>>>>>>>>>>
>>>>>>>>>> The issue is that if you build up pushdowns incrementally and no=
t
>>>>>>>>>> all at once, you end up having to reject pushdowns and filters t=
hat you
>>>>>>>>>> actually can do, which unnecessarily increases overheads.
>>>>>>>>>>
>>>>>>>>>> For example, the dataset
>>>>>>>>>>
>>>>>>>>>> a b c
>>>>>>>>>> 1 2 3
>>>>>>>>>> 1 3 3
>>>>>>>>>> 1 3 4
>>>>>>>>>> 2 1 1
>>>>>>>>>> 2 0 1
>>>>>>>>>>
>>>>>>>>>> can efficiently push down sort(b, c) if I have already applied
>>>>>>>>>> the filter a =3D 1, but otherwise will force a sort in Spark. On=
 the PR I
>>>>>>>>>> detail a case I see where I can push down two equality filters i=
ff I am
>>>>>>>>>> given them at the same time, whilst not being able to one at a t=
ime.
>>>>>>>>>>
>>>>>>>>>> There are loads of cases like this - you can imagine someone
>>>>>>>>>> being able to push down a sort before a filter is applied, but n=
ot
>>>>>>>>>> afterwards. However, maybe the filter is so selective that it's =
better to
>>>>>>>>>> push down the filter and not handle the sort. I don't get to mak=
e this
>>>>>>>>>> decision, Spark does (but doesn't have good enough information t=
o do it
>>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I =
push down
>>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let=
 me do
>>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this=
 way.
>>>>>>>>>>
>>>>>>>>>> The pattern of not considering things that can be done in bulk
>>>>>>>>>> bites us in other ways. The retrieval methods end up being trick=
ier to
>>>>>>>>>> implement than is necessary because frequently a single operatio=
n provides
>>>>>>>>>> the result of many of the getters, but the state is mutable, so =
you end up
>>>>>>>>>> with odd caches.
>>>>>>>>>>
>>>>>>>>>> For example, the work I need to do to answer unhandledFilters in
>>>>>>>>>> V1 is roughly the same as the work I need to do to buildScan, so=
 I want to
>>>>>>>>>> cache it. This means that I end up with code that looks like:
>>>>>>>>>>
>>>>>>>>>> public final class CachingFoo implements Foo {
>>>>>>>>>>     private final Foo delegate;
>>>>>>>>>>
>>>>>>>>>>     private List<Filter> currentFilters =3D emptyList();
>>>>>>>>>>     private Supplier<Bar> barSupplier =3D
>>>>>>>>>> newSupplier(currentFilters);
>>>>>>>>>>
>>>>>>>>>>     public CachingFoo(Foo delegate) {
>>>>>>>>>>         this.delegate =3D delegate;
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     private Supplier<Bar> newSupplier(List<Filter> filters) {
>>>>>>>>>>         return Suppliers.memoize(() ->
>>>>>>>>>> delegate.computeBar(filters));
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>>     @Override
>>>>>>>>>>     public Bar computeBar(List<Filter> filters) {
>>>>>>>>>>         if (!filters.equals(currentFilters)) {
>>>>>>>>>>             currentFilters =3D filters;
>>>>>>>>>>             barSupplier =3D newSupplier(filters);
>>>>>>>>>>         }
>>>>>>>>>>
>>>>>>>>>>         return barSupplier.get();
>>>>>>>>>>     }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> which caches the result required in unhandledFilters on the
>>>>>>>>>> expectation that Spark will call buildScan afterwards and get to=
 use the
>>>>>>>>>> result..
>>>>>>>>>>
>>>>>>>>>> This kind of cache becomes more prominent, but harder to deal
>>>>>>>>>> with in the new APIs. As one example here, the state I will need=
 in order
>>>>>>>>>> to compute accurate column stats internally will likely be a sub=
set of the
>>>>>>>>>> work required in order to get the read tasks, tell you if I can =
handle
>>>>>>>>>> filters, etc, so I'll want to cache them for reuse. However, the=
 cached
>>>>>>>>>> information needs to be appropriately invalidated when I add a n=
ew filter
>>>>>>>>>> or sort order or limit, and this makes implementing the APIs har=
der and
>>>>>>>>>> more error-prone.
>>>>>>>>>>
>>>>>>>>>> One thing that'd be great is a defined contract of the order in
>>>>>>>>>> which Spark calls the methods on your datasource (ideally this c=
ontract
>>>>>>>>>> could be implied by the way the Java class structure works, but =
otherwise I
>>>>>>>>>> can just throw).
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>> On Tue, 29 Aug 2017 at 02:56 Reynold Xin <rxin@databricks.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> James,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comment. I think you just pointed out a trade-of=
f
>>>>>>>>>>> between expressiveness and API simplicity, compatibility and ev=
olvability.
>>>>>>>>>>> For the max expressiveness, we'd want the ability to expose ful=
l query
>>>>>>>>>>> plans, and let the data source decide which part of the query p=
lan can be
>>>>>>>>>>> pushed down.
>>>>>>>>>>>
>>>>>>>>>>> The downside to that (full query plan push down) are:
>>>>>>>>>>>
>>>>>>>>>>> 1. It is extremely difficult to design a stable representation
>>>>>>>>>>> for logical / physical plan. It is doable, but we'd be the firs=
t to do
>>>>>>>>>>> it. I'm not sure of any mainstream databases being able to do t=
hat in the
>>>>>>>>>>> past. The design of that API itself, to make sure we have a goo=
d story for
>>>>>>>>>>> backward and forward compatibility, would probably take months =
if not
>>>>>>>>>>> years. It might still be good to do, or offer an experimental t=
rait without
>>>>>>>>>>> compatibility guarantee that uses the current Catalyst internal=
 logical
>>>>>>>>>>> plan.
>>>>>>>>>>>
>>>>>>>>>>> 2. Most data source developers simply want a way to offer some
>>>>>>>>>>> data, without any pushdown. Having to understand query plans is=
 a burden
>>>>>>>>>>> rather than a gift.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Re: your point about the proposed v2 being worse than v1 for
>>>>>>>>>>> your use case.
>>>>>>>>>>>
>>>>>>>>>>> Can you say more? You used the argument that in v2 there are
>>>>>>>>>>> more support for broader pushdown and as a result it is harder =
to
>>>>>>>>>>> implement. That's how it is supposed to be. If a data source si=
mply
>>>>>>>>>>> implements one of the trait, it'd be logically identical to v1.=
 I don't see
>>>>>>>>>>> why it would be worse or better, other than v2 provides much st=
ronger
>>>>>>>>>>> forward compatibility guarantees than v1.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Aug 29, 2017 at 4:54 AM, James Baker <
>>>>>>>>>>> j.baker@outlook.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Copying from the code review comments I just submitted on the
>>>>>>>>>>>> draft API (https://github.com/cloud-fan/
>>>>>>>>>>>> spark/pull/10#pullrequestreview-59088745):
>>>>>>>>>>>>
>>>>>>>>>>>> Context here is that I've spent some time implementing a Spark
>>>>>>>>>>>> datasource and have had some issues with the current API which=
 are made
>>>>>>>>>>>> worse in V2.
>>>>>>>>>>>>
>>>>>>>>>>>> The general conclusion I=E2=80=99ve come to here is that this =
is very
>>>>>>>>>>>> hard to actually implement (in a similar but more aggressive w=
ay than
>>>>>>>>>>>> DataSource V1, because of the extra methods and dimensions we =
get in V2).
>>>>>>>>>>>>
>>>>>>>>>>>> In DataSources V1 PrunedFilteredScan, the issue is that you ar=
e
>>>>>>>>>>>> passed in the filters with the buildScan method, and then pass=
ed in again
>>>>>>>>>>>> with the unhandledFilters method.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the filters that you can=E2=80=99t handle might be da=
ta
>>>>>>>>>>>> dependent, which the current API does not handle well. Suppose=
 I can handle
>>>>>>>>>>>> filter A some of the time, and filter B some of the time. If I=
=E2=80=99m passed in
>>>>>>>>>>>> both, then either A and B are unhandled, or A, or B, or neithe=
r. The work I
>>>>>>>>>>>> have to do to work this out is essentially the same as I have =
to do while
>>>>>>>>>>>> actually generating my RDD (essentially I have to generate my =
partitions),
>>>>>>>>>>>> so I end up doing some weird caching work.
>>>>>>>>>>>>
>>>>>>>>>>>> This V2 API proposal has the same issues, but perhaps moreso.
>>>>>>>>>>>> In PrunedFilteredScan, there is essentially one degree of free=
dom for
>>>>>>>>>>>> pruning (filters), so you just have to implement caching betwe=
en
>>>>>>>>>>>> unhandledFilters and buildScan. However, here we have many deg=
rees of
>>>>>>>>>>>> freedom; sorts, individual filters, clustering, sampling, mayb=
e
>>>>>>>>>>>> aggregations eventually - and these operations are not all com=
mutative, and
>>>>>>>>>>>> computing my support one-by-one can easily end up being more e=
xpensive than
>>>>>>>>>>>> computing all in one go.
>>>>>>>>>>>>
>>>>>>>>>>>> For some trivial examples:
>>>>>>>>>>>>
>>>>>>>>>>>> - After filtering, I might be sorted, whilst before filtering =
I
>>>>>>>>>>>> might not be.
>>>>>>>>>>>>
>>>>>>>>>>>> - Filtering with certain filters might affect my ability to
>>>>>>>>>>>> push down others.
>>>>>>>>>>>>
>>>>>>>>>>>> - Filtering with aggregations (as mooted) might not be possibl=
e
>>>>>>>>>>>> to push down.
>>>>>>>>>>>>
>>>>>>>>>>>> And with the API as currently mooted, I need to be able to go
>>>>>>>>>>>> back and change my results because they might change later.
>>>>>>>>>>>>
>>>>>>>>>>>> Really what would be good here is to pass all of the filters
>>>>>>>>>>>> and sorts etc all at once, and then I return the parts I can=
=E2=80=99t handle.
>>>>>>>>>>>>
>>>>>>>>>>>> I=E2=80=99d prefer in general that this be implemented by pass=
ing some
>>>>>>>>>>>> kind of query plan to the datasource which enables this kind o=
f
>>>>>>>>>>>> replacement. Explicitly don=E2=80=99t want to give the whole q=
uery plan - that
>>>>>>>>>>>> sounds painful - would prefer we push down only the parts of t=
he query plan
>>>>>>>>>>>> we deem to be stable. With the mix-in approach, I don=E2=80=99=
t think we can
>>>>>>>>>>>> guarantee the properties we want without a two-phase thing - I=
=E2=80=99d really
>>>>>>>>>>>> love to be able to just define a straightforward union type wh=
ich is our
>>>>>>>>>>>> supported pushdown stuff, and then the user can transform and =
return it.
>>>>>>>>>>>>
>>>>>>>>>>>> I think this ends up being a more elegant API for consumers,
>>>>>>>>>>>> and also far more intuitive.
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 28 Aug 2017 at 18:00 =E8=92=8B=E6=98=9F=E5=8D=9A <jian=
gxb1987@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1 (Non-binding)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xiao Li <gatorsmile@gmail.com>=E4=BA=8E2017=E5=B9=B48=E6=9C=
=8828=E6=97=A5 =E5=91=A8=E4=B8=80=E4=B8=8B=E5=8D=885:38=E5=86=99=E9=81=93=
=EF=BC=9A
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger <cody@koeninger.or=
g
>>>>>>>>>>>>>> >:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Just wanted to point out that because the jira isn't labele=
d
>>>>>>>>>>>>>>> SPIP, it
>>>>>>>>>>>>>>> won't have shown up linked from
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> http://spark.apache.org/improvement-proposals.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <
>>>>>>>>>>>>>>> cloud0fan@gmail.com> wrote:
>>>>>>>>>>>>>>> > Hi all,
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > It has been almost 2 weeks since I proposed the data
>>>>>>>>>>>>>>> source V2 for
>>>>>>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA
>>>>>>>>>>>>>>> ticket and the
>>>>>>>>>>>>>>> > prototype PR, so I'd like to call for a vote.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > The full document of the Data Source API V2 is:
>>>>>>>>>>>>>>> > https://docs.google.com/docume
>>>>>>>>>>>>>>> nt/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Note that, this vote should focus on high-level
>>>>>>>>>>>>>>> design/framework, not
>>>>>>>>>>>>>>> > specified APIs, as we can always change/improve specified
>>>>>>>>>>>>>>> APIs during
>>>>>>>>>>>>>>> > development.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply
>>>>>>>>>>>>>>> with your vote:
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP.
>>>>>>>>>>>>>>> > +0: Don't really care.
>>>>>>>>>>>>>>> > -1: I don't think this is a good idea because of the
>>>>>>>>>>>>>>> following technical
>>>>>>>>>>>>>>> > reasons.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Thanks!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -----------------------------------------------------------=
-
>>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

--f40304366f9a845e1b055890464a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Ryan,<div><br></div><div>Yea I agree with you that we s=
hould discuss some=C2=A0<span style=3D"font-size:12.8px">substantial detail=
s during the vote, and I addressed your comments about schema inference API=
 in my new PR, please take a look.</span></div><div><span style=3D"font-siz=
e:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">I&#39;ve a=
lso called a new vote for the read path, please vote there, thanks!</span><=
/div></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu=
, Sep 7, 2017 at 7:55 AM, Ryan Blue <span dir=3D"ltr">&lt;<a href=3D"mailto=
:rblue@netflix.com" target=3D"_blank">rblue@netflix.com</a>&gt;</span> wrot=
e:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-l=
eft:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">I&#39;m all for keepi=
ng this moving and not getting too far into the details (like naming), but =
I think the substantial details should be clarified first since they are in=
 the proposal that&#39;s being voted on.<div><br></div><div>I would prefer =
moving the write side to a separate SPIP, too, since there isn&#39;t much d=
etail in the proposal and I think we should be more deliberate with things =
like schema evolution.</div></div><div class=3D"HOEnZb"><div class=3D"h5"><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Aug 31, 20=
17 at 10:33 AM, Wenchen Fan <span dir=3D"ltr">&lt;<a href=3D"mailto:cloud0f=
an@gmail.com" target=3D"_blank">cloud0fan@gmail.com</a>&gt;</span> wrote:<b=
r><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:=
1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Ryan,<div><br></div><d=
iv>I think for a SPIP, we should not worry too much about details, as we ca=
n discuss them during PR review after the vote pass.</div><div><br></div><d=
iv>I think we should focus more on the overall design, like James did. The =
interface mix-in vs plan push down discussion was great, hope we can get a =
consensus on this topic soon. The current proposal is, we keep the interfac=
e mix-in framework, and add an unstable plan push down trait.</div><div><br=
></div><div>For details like interface names, sort push down vs sort propag=
ate, etc., I think they should not block the vote, as they can be updated/i=
mproved within the current interface mix-in framework.</div><div><br></div>=
<div>About separating read/write proposals, we should definitely send indiv=
idual PRs for read/write when developing data source v2. I&#39;m also OK wi=
th voting on the read side first. The write side is way simpler than the re=
ad side, I think it&#39;s more important to get agreement on the=C2=A0read =
side first.</div><div><br></div><div>BTW, I do appreciate your feedbacks/co=
mments on the prototype, let&#39;s keep the discussion there. In the meanwh=
ile, let&#39;s have more discussion on the overall framework, and drive thi=
s project together.</div><span class=3D"m_-1187179626336977084HOEnZb"><font=
 color=3D"#888888"><div><br></div><div>Wenchen</div><div><br></div><div><br=
></div></font></span></div><div class=3D"m_-1187179626336977084HOEnZb"><div=
 class=3D"m_-1187179626336977084h5"><div class=3D"gmail_extra"><br><div cla=
ss=3D"gmail_quote">On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue <span dir=3D"=
ltr">&lt;<a href=3D"mailto:rblue@netflix.com" target=3D"_blank">rblue@netfl=
ix.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"=
margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"=
ltr">Maybe I&#39;m missing something, but the high-level proposal consists =
of: Goals, Non-Goals, and Proposed API. What is there to discuss other than=
 the details of the API that&#39;s being proposed? I think the goals make s=
ense, but goals alone aren&#39;t enough to approve a SPIP.</div><div class=
=3D"m_-1187179626336977084m_4021611465736319531HOEnZb"><div class=3D"m_-118=
7179626336977084m_4021611465736319531h5"><div class=3D"gmail_extra"><br><di=
v class=3D"gmail_quote">On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin <span =
dir=3D"ltr">&lt;<a href=3D"mailto:rxin@databricks.com" target=3D"_blank">rx=
in@databricks.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote=
" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><=
div dir=3D"ltr">So we seem to be getting into a cycle of discussing more ab=
out the details of APIs than the high level proposal. The details of APIs a=
re important to debate, but those belong more in code reviews.<div><br></di=
v><div>One other important thing is that we should avoid API design by comm=
ittee. While it is extremely useful to get feedback, understand the use cas=
es, we cannot do API design by incorporating verbatim the union of everybod=
y&#39;s feedback. API design is largely a tradeoff game. The most expressiv=
e API would also be harder to use, or sacrifice backward/forward compatibil=
ity. It is as important to decide what to exclude as what to include.</div>=
<div><br></div><div>Unlike the v1 API, the way Wenchen&#39;s high level V2 =
framework is proposed makes it very easy to add new features (e.g. clusteri=
ng properties) in the future without breaking any APIs. I&#39;d rather us s=
hipping something useful that might not be the most comprehensive set, than=
 debating about every single feature we should add and then creating someth=
ing super complicated that has unclear value.</div><div><br></div><div><br>=
</div></div><div class=3D"m_-1187179626336977084m_4021611465736319531m_8130=
744968657909505HOEnZb"><div class=3D"m_-1187179626336977084m_40216114657363=
19531m_8130744968657909505h5"><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue <span dir=3D"ltr">&=
lt;<a href=3D"mailto:rblue@netflix.com" target=3D"_blank">rblue@netflix.com=
</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><=
div class=3D"m_-1187179626336977084m_4021611465736319531m_81307449686579095=
05m_6320936559034470976m_8249949939977008092markdown-here-wrapper"><p style=
=3D"margin:0px 0px 1.2em!important">-1 (non-binding)</p>
<p style=3D"margin:0px 0px 1.2em!important">Sometimes it takes a VOTE threa=
d to get people to actually read and comment, so thanks for starting this o=
ne=E2=80=A6 but there=E2=80=99s still discussion happening on the prototype=
 API, which it hasn=E2=80=99t been updated. I=E2=80=99d like to see the pro=
posal shaped by the ongoing discussion so that we have a better, more concr=
ete plan. I think that=E2=80=99s going to produces a better SPIP.</p>
<p style=3D"margin:0px 0px 1.2em!important">The second reason for -1 is tha=
t I think the read- and write-side proposals should be separated. The <a hr=
ef=3D"https://github.com/cloud-fan/spark/pull/10" target=3D"_blank">PR</a> =
currently has =E2=80=9Cwrite path=E2=80=9D listed as a TODO item and most o=
f the discussion I=E2=80=99ve seen is on the read side. I think it would be=
 better to separate the read and write APIs so we can focus on them individ=
ually.</p>
<p style=3D"margin:0px 0px 1.2em!important">An example of why we should foc=
us on the write path separately is that the proposal says this:</p>
<blockquote style=3D"margin:1.2em 0px;border-left:4px solid rgb(221,221,221=
);padding:0px 1em;color:rgb(119,119,119);quotes:none">
<p style=3D"margin:0px 0px 1.2em!important">Ideally partitioning/bucketing =
concept should not be exposed in the Data Source API V2, because they are j=
ust techniques for data skipping and pre-partitioning. However, these 2 con=
cepts are already widely used in Spark, e.g. DataFrameWriter.partitionBy an=
d DDL syntax like ADD PARTITION. To be consistent, we need to add partition=
ing/bucketing to Data Source V2 . . .</p>
</blockquote>
<p style=3D"margin:0px 0px 1.2em!important">Essentially, the some APIs mix =
DDL and DML operations. I=E2=80=99d like to consider ways to fix that probl=
em instead of carrying the problem forward to Data Source V2. We can solve =
this by adding a high-level API for DDL and a better write/insert API that =
works well with it. Clearly, that discussion is independent of the read pat=
h, which is why I think separating the two proposals would be a win.</p>
<p style=3D"margin:0px 0px 1.2em!important">rb</p>
<div title=3D"MDH:LTEgKG5vbi1iaW5kaW5nKTxkaXY+PGJyPjwvZGl2PjxkaXY+U29tZXRpb=
WVzIGl0IHRha2VzIGEg
Vk9URSB0aHJlYWQgdG8gZ2V0IHBlb3BsZSB0byBhY3R1YWxseSByZWFkIGFuZCBjb21tZW50LCB=
z
byB0aGFua3MgZm9yIHN0YXJ0aW5nIHRoaXMgb25lLi4uIGJ1dCB0aGVyZSdzIHN0aWxsIGRpc2N=
1
c3Npb24gaGFwcGVuaW5nIG9uIHRoZSBwcm90b3R5cGUgQVBJLCB3aGljaCBpdCBoYXNuJ3QgYmV=
l
biB1cGRhdGVkLiBJJ2QgbGlrZSB0byBzZWUgdGhlIHByb3Bvc2FsIHNoYXBlZCBieSB0aGUgb25=
n
b2luZyBkaXNjdXNzaW9uIHNvIHRoYXQgd2UgaGF2ZSBhIGJldHRlciwgbW9yZSBjb25jcmV0ZSB=
w
bGFuLiBJIHRoaW5rIHRoYXQncyBnb2luZyB0byBwcm9kdWNlcyBhIGJldHRlciBTUElQLjwvZGl=
2
PjxkaXY+PGJyPjwvZGl2PjxkaXY+VGhlIHNlY29uZCByZWFzb24gZm9yIC0xIGlzIHRoYXQgSSB=
0
aGluayB0aGUgcmVhZC0gYW5kIHdyaXRlLXNpZGUgcHJvcG9zYWxzIHNob3VsZCBiZSBzZXBhcmF=
0
ZWQuIFRoZSBbUFJdKGh0dHBzOi8vZ2l0aHViLmNvbS9jbG91ZC1mYW4vc3BhcmsvcHVsbC8xMCk=
m
bmJzcDtjdXJyZW50bHkgaGFzICJ3cml0ZSBwYXRoIiBsaXN0ZWQgYXMgYSBUT0RPIGl0ZW0gYW5=
k
IG1vc3Qgb2YgdGhlIGRpc2N1c3Npb24gSSd2ZSBzZWVuIGlzIG9uIHRoZSByZWFkIHNpZGUuIEk=
g
dGhpbmsgaXQgd291bGQgYmUgYmV0dGVyIHRvIHNlcGFyYXRlIHRoZSByZWFkIGFuZCB3cml0ZSB=
B
UElzIHNvIHdlIGNhbiBmb2N1cyBvbiB0aGVtIGluZGl2aWR1YWxseS48L2Rpdj48ZGl2Pjxicj4=
8
L2Rpdj48ZGl2PkFuIGV4YW1wbGUgb2Ygd2h5IHdlIHNob3VsZCBmb2N1cyBvbiB0aGUgd3JpdGU=
g
cGF0aCBzZXBhcmF0ZWx5IGlzIHRoYXQgdGhlIHByb3Bvc2FsIHNheXMgdGhpczo8L2Rpdj48ZGl=
2
Pjxicj48L2Rpdj48ZGl2PiZndDsmbmJzcDtJZGVhbGx5IHBhcnRpdGlvbmluZy9idWNrZXRpbmc=
g
Y29uY2VwdCBzaG91bGQgbm90IGJlIGV4cG9zZWQgaW4gdGhlIERhdGEgU291cmNlIEFQSSBWMiw=
g
YmVjYXVzZSB0aGV5IGFyZSBqdXN0IHRlY2huaXF1ZXMgZm9yIGRhdGEgc2tpcHBpbmcgYW5kIHB=
y
ZS1wYXJ0aXRpb25pbmcuIEhvd2V2ZXIsIHRoZXNlIDIgY29uY2VwdHMgYXJlIGFscmVhZHkgd2l=
k
ZWx5IHVzZWQgaW4gU3BhcmssIGUuZy4gRGF0YUZyYW1lV3JpdGVyLnBhcnRpdGlvbkJ5IGFuZCB=
E
REwgc3ludGF4IGxpa2UgQUREIFBBUlRJVElPTi4gVG8gYmUgY29uc2lzdGVudCwgd2UgbmVlZCB=
0
byBhZGQgcGFydGl0aW9uaW5nL2J1Y2tldGluZyB0byBEYXRhIFNvdXJjZSBWMiAuIC4gLjwvZGl=
2
PjxkaXY+PGJyPjwvZGl2PjxkaXY+RXNzZW50aWFsbHksIHRoZSBzb21lIEFQSXMgbWl4IERETCB=
h
bmQgRE1MIG9wZXJhdGlvbnMuIEknZCBsaWtlIHRvIGNvbnNpZGVyIHdheXMgdG8gZml4IHRoYXQ=
g
cHJvYmxlbSBpbnN0ZWFkIG9mIGNhcnJ5aW5nIHRoZSBwcm9ibGVtIGZvcndhcmQgdG8gRGF0YSB=
T
b3VyY2UgVjIuIFdlIGNhbiBzb2x2ZSB0aGlzIGJ5IGFkZGluZyBhIGhpZ2gtbGV2ZWwgQVBJIGZ=
v
ciBEREwgYW5kIGEgYmV0dGVyIHdyaXRlL2luc2VydCBBUEkgdGhhdCB3b3JrcyB3ZWxsIHdpdGg=
g
aXQuIENsZWFybHksIHRoYXQgZGlzY3Vzc2lvbiBpcyBpbmRlcGVuZGVudCBvZiB0aGUgcmVhZCB=
w
YXRoLCB3aGljaCBpcyB3aHkgSSB0aGluayBzZXBhcmF0aW5nIHRoZSB0d28gcHJvcG9zYWxzIHd=
v
dWxkIGJlIGEgd2luLjwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+cmI8L2Rpdj4=3D" style=
=3D"height:0;width:0;max-height:0;max-width:0;overflow:hidden;font-size:0em=
;padding:0;margin:0">=E2=80=8B</div></div></div><div class=3D"gmail_extra">=
<div><div class=3D"m_-1187179626336977084m_4021611465736319531m_81307449686=
57909505m_6320936559034470976h5"><br><div class=3D"gmail_quote">On Wed, Aug=
 30, 2017 at 4:28 AM, Reynold Xin <span dir=3D"ltr">&lt;<a href=3D"mailto:r=
xin@databricks.com" target=3D"_blank">rxin@databricks.com</a>&gt;</span> wr=
ote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div><div dir=3D"auto">That might be=
 good to do, but seems like orthogonal to this effort itself. It would be a=
 completely different interface.=C2=A0</div><div><div class=3D"m_-118717962=
6336977084m_4021611465736319531m_8130744968657909505m_6320936559034470976m_=
8249949939977008092h5"><br><div class=3D"gmail_quote"><div>On Wed, Aug 30, =
2017 at 1:10 PM Wenchen Fan &lt;<a href=3D"mailto:cloud0fan@gmail.com" targ=
et=3D"_blank">cloud0fan@gmail.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div>OK I agree with it, how about we add a new interface to =
push down the query plan, based on the current framework? We can mark the q=
uery-plan-push-down interface as unstable, to save the effort of designing =
a stable representation of query plan and maintaining forward compatibility=
.</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, Au=
g 30, 2017 at 10:53 AM, James Baker <span>&lt;<a href=3D"mailto:j.baker@out=
look.com" target=3D"_blank">j.baker@outlook.com</a>&gt;</span> wrote:<br><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex">


<div>
<div><span>I&#39;ll just focus on the one-by-one thing for now - it&#39;s t=
he thing that blocks me the most.
<div>
<div><br>
</div>
<div>I think the place where we&#39;re most confused here is on the cost of=
 determining whether I can push down a filter. For me, in order to work out=
 whether I can push down a filter or satisfy a sort, I might have to read p=
lenty of data. That said, it&#39;s worth
 me doing this because I can use this information to avoid reading &gt;&gt;=
that much data.</div>
<div><br>
</div>
<div>If you give me all the orderings, I will have to read that data many t=
imes (we stream it to avoid keeping it in memory).</div>
<div><br>
</div>
<div>There&#39;s also a thing where our typical use cases have many filters=
 (20+ is common). So, it&#39;s likely not going to work to pass us all the =
combinations. That said, if I can tell you a cost, I know what optimal look=
s like, why can&#39;t I just pick that myself?</div>
<div><br>
</div>
<div>The current design is friendly to simple datasources, but does not hav=
e the potential to support this.</div>
</div>
<div><br>
</div>
<div>So the main problem we have with datasources v1 is that it&#39;s essen=
tially impossible to leverage a bunch of Spark features - I don&#39;t get t=
o use bucketing or row batches or all the nice things that I really want to=
 use to get decent performance. Provided
 I can leverage these in a moderately supported way which won&#39;t break i=
n any given commit, I&#39;ll be pretty happy with anything that lets me opt=
 out of the restrictions.</div>
<div><br>
</div>
<div>My suggestion here is that if you make a mode which works well for com=
plicated use cases, you end up being able to write simple mode in terms of =
it very easily. So we could actually provide two APIs, one that lets people=
 who have more interesting datasources
 leverage the cool Spark features, and one that lets people who just want t=
o implement basic features do that - I&#39;d try to include some kind of la=
yering here. I could probably sketch out something here if that&#39;d be us=
eful?</div>
</span><div>
<div>
<div>
<div><br>
</div>
<div>James</div>
</div>
</div>
</div><div><div class=3D"m_-1187179626336977084m_4021611465736319531m_81307=
44968657909505m_6320936559034470976m_8249949939977008092m_25104046450617725=
29m_-4611714659132852580h5">
<br>
<div class=3D"gmail_quote">
<div>On Tue, 29 Aug 2017 at 18:59 Wenchen Fan &lt;<a href=3D"mailto:cloud0f=
an@gmail.com" target=3D"_blank">cloud0fan@gmail.com</a>&gt; wrote:<br>
</div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div>Hi James,
<div><br>
</div>
<div>Thanks for your feedback! I think your concerns are all valid, but we =
need to make a tradeoff here.</div>
</div>
</div>
<div>
<div>
<div><br>
</div>
<div>&gt;=C2=A0<span style=3D"font-size:12.8px">Explicitly here, what I&#39=
;m looking for is a convenient mechanism to accept a fully specified set of=
 arguments</span><span style=3D"font-size:12.8px">=C2=A0</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
</div>
</div>
<div>
<div>
<div><span style=3D"font-size:12.8px">The problem with this approach is:=C2=
=A01) if we wanna add more=C2=A0</span><span style=3D"font-size:12.8px">arg=
uments</span><span style=3D"font-size:12.8px">=C2=A0in the future, it&#39;s=
 really hard to do without changing the existing interface.
 2) if a user wants to implement a very simple data source, he has to look =
at all the arguments and understand them, which may be a burden for him.</s=
pan></div>
<div><span style=3D"font-size:12.8px">I don&#39;t have a solution to solve =
these 2 problems, comments are welcome.</span></div>
</div>
</div>
<div>
<div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
<div>&gt;=C2=A0<span style=3D"font-size:12.8px">There are loads of cases li=
ke this - you can imagine someone being able to push down a sort before a f=
ilter is applied, but not afterwards. However, maybe the filter is so selec=
tive that it&#39;s better to push down the filter
 and not handle the sort. I don&#39;t get to make this decision, Spark does=
 (but doesn&#39;t have good enough information to do it properly, whilst I =
do). I want to be able to choose the parts I push down given knowledge of m=
y datasource - as defined the APIs don&#39;t
 let me do that, they&#39;re strictly more restrictive than the V1 APIs in =
this way.</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
</div>
</div>
<div>
<div>
<div><span style=3D"font-size:12.8px">This is true, the current framework a=
pplies push downs one by one, incrementally. If a data source wanna go back=
 to accept a sort push down after it accepts a filter push down, it&#39;s i=
mpossible with the current data source
 V2.</span></div>
<div><span style=3D"font-size:12.8px">Fortunately, we have a solution for t=
his problem. At Spark side, actually we do have a fully=C2=A0</span><span s=
tyle=3D"font-size:12.8px">specified set of arguments</span><span style=3D"f=
ont-size:12.8px">=C2=A0waiting to be pushed down,
 but Spark doesn&#39;t know which is the best order to push them into data =
source. Spark can try every combination and ask the data source to report a=
 cost, then Spark can pick the best combination with the lowest cost. This =
can also be implemented as a cost report
 interface, so that advanced data source can implement it for optimal perfo=
rmance, and simple data source doesn&#39;t need to care about it and keep s=
imple.</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
<div><span style=3D"font-size:12.8px">The current design is very friendly t=
o simple data source, and has the potential to support complex data source,=
 I prefer the current design over the plan push down one. What do you think=
?</span></div>
<div><span style=3D"font-size:12.8px"><br>
</span></div>
</div>
</div>
<div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Wed, Aug 30, 2017 at 5:53 AM, James Baker <sp=
an>
&lt;<a href=3D"mailto:j.baker@outlook.com" target=3D"_blank">j.baker@outloo=
k.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div>Yeah, for sure.
<div>
<div>
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-461171=
4659132852580m_-5970601458075266271m_-895255321151999255m_27780111390568030=
40h5">
<div><br>
</div>
<div>With the stable representation - agree that in the general case this i=
s pretty intractable, it restricts the modifications that you can do in the=
 future too much. That said, it shouldn&#39;t be as hard if you restrict yo=
urself to the parts of the plan which
 are supported by the datasources V2 API (which after all, need to be trans=
lateable properly into the future to support the mixins proposed). This sho=
uld have a pretty small scope in comparison. As long as the user can bail o=
ut of nodes they don&#39;t understand,
 they should be ok, right?</div>
<div><br>
</div>
<div>That said, what would also be fine for us is a place to plug into an u=
nstable query plan.</div>
<div><br>
</div>
<div>Explicitly here, what I&#39;m looking for is a convenient mechanism to=
 accept a fully specified set of arguments (of which I can choose to ignore=
 some), and return the information as to which of them I&#39;m ignoring. Ta=
king a query plan of sorts is a way of doing
 this which IMO is intuitive to the user. It also provides a convenient loc=
ation to plug in things like stats. Not at all married to the idea of using=
 a query plan here; it just seemed convenient.</div>
<div><br>
</div>
<div>Regarding the users who just want to be able to pump data into Spark, =
my understanding is that replacing isolated nodes in a query plan is easy. =
That said, our goal here is to be able to push down as much as possible int=
o the underlying datastore.</div>
<div><br>
</div>
<div>To your second question:</div>
<div><br>
</div>
<div>The issue is that if you build up pushdowns incrementally and not all =
at once, you end up having to reject pushdowns and filters that you actuall=
y can do, which unnecessarily increases overheads.</div>
<div><br>
</div>
<div>For example, the dataset</div>
<div><br>
</div>
<div>a b c</div>
<div>1 2 3</div>
<div>1 3 3</div>
<div>1 3 4</div>
<div>2 1 1</div>
<div>2 0 1</div>
<div><br>
</div>
<div>can efficiently push down sort(b, c) if I have already applied the fil=
ter a =3D 1, but otherwise will force a sort in Spark. On the PR I detail a=
 case I see where I can push down two equality filters iff I am given them =
at the same time, whilst not being
 able to one at a time.</div>
<div><br>
</div>
<div>There are loads of cases like this - you can imagine someone being abl=
e to push down a sort before a filter is applied, but not afterwards. Howev=
er, maybe the filter is so selective that it&#39;s better to push down the =
filter and not handle the sort. I don&#39;t
 get to make this decision, Spark does (but doesn&#39;t have good enough in=
formation to do it properly, whilst I do). I want to be able to choose the =
parts I push down given knowledge of my datasource - as defined the APIs do=
n&#39;t let me do that, they&#39;re strictly
 more restrictive than the V1 APIs in this way.</div>
<div><br>
</div>
<div>The pattern of not considering things that can be done in bulk bites u=
s in other ways. The retrieval methods end up being trickier to implement t=
han is necessary because frequently a single operation provides the result =
of many of the getters, but the
 state is mutable, so you end up with odd caches.</div>
<div><br>
</div>
<div>For example, the work I need to do to answer unhandledFilters in V1 is=
 roughly the same as the work I need to do to buildScan, so I want to cache=
 it. This means that I end up with code that looks like:</div>
<div><br>
</div>
<div>
<div>public final class CachingFoo implements Foo {</div>
<div>=C2=A0 =C2=A0 private final Foo delegate;</div>
<div><br>
</div>
<div>=C2=A0 =C2=A0 private List&lt;Filter&gt; currentFilters =3D emptyList(=
);</div>
<div>=C2=A0 =C2=A0 private Supplier&lt;Bar&gt; barSupplier =3D newSupplier(=
currentFilters);</div>
<div><br>
</div>
<div>=C2=A0 =C2=A0 public CachingFoo(Foo delegate) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 this.delegate =3D delegate;</div>
<div>=C2=A0 =C2=A0 }</div>
<div><br>
</div>
<div>=C2=A0 =C2=A0 private Supplier&lt;Bar&gt; newSupplier(List&lt;Filter&g=
t; filters) {</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 return Suppliers.memoize(() -&gt; delegate=
.computeBar(filters));</div>
<div>=C2=A0 =C2=A0 }</div>
<div><br>
</div>
<div>=C2=A0 =C2=A0 @Override</div>
<div>=C2=A0 =C2=A0 public Bar computeBar(List&lt;Filter&gt; filters) {</div=
>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!filters.equals(currentFilter<wbr>s)) =
{</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 currentFilters =3D filters;<=
/div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 barSupplier =3D newSupplier(=
filters);</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 }</div>
<div><br>
</div>
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 return barSupplier.get();</div>
<div>=C2=A0 =C2=A0 }</div>
<div>}</div>
</div>
<div><br>
</div>
<div>which caches the result required in unhandledFilters on the expectatio=
n that Spark will call buildScan afterwards and get to use the result..</di=
v>
<div><br>
</div>
<div>This kind of cache becomes more prominent, but harder to deal with in =
the new APIs. As one example here, the state I will need in order to comput=
e accurate column stats internally will likely be a subset of the work requ=
ired in order to get the read tasks,
 tell you if I can handle filters, etc, so I&#39;ll want to cache them for =
reuse. However, the cached information needs to be appropriately invalidate=
d when I add a new filter or sort order or limit, and this makes implementi=
ng the APIs harder and more error-prone.</div>
<div>
<div><br>
</div>
<div>One thing that&#39;d be great is a defined contract of the order in wh=
ich Spark calls the methods on your datasource (ideally this contract could=
 be implied by the way the Java class structure works, but otherwise I can =
just throw).</div>
<div><br>
</div>
<div>James</div>
</div>
<br>
</div>
</div>
<div class=3D"gmail_quote">
<div>
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-461171=
4659132852580m_-5970601458075266271m_-895255321151999255m_27780111390568030=
40h5">
<div>On Tue, 29 Aug 2017 at 02:56 Reynold Xin &lt;<a href=3D"mailto:rxin@da=
tabricks.com" target=3D"_blank">rxin@databricks.com</a>&gt; wrote:<br>
</div>
</div>
</div>
<div>
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-461171=
4659132852580m_-5970601458075266271m_-895255321151999255m_27780111390568030=
40h5">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div>James,
<div><br>
</div>
<div>Thanks for the comment. I think you just pointed out a trade-off betwe=
en expressiveness and API simplicity, compatibility and evolvability. For t=
he max expressiveness, we&#39;d want the ability to expose full query plans=
, and let the data source decide which
 part of the query plan can be pushed down.</div>
<div><br>
</div>
<div>The downside to that (full query plan push down) are:</div>
<div><br>
</div>
<div>1. It is extremely difficult to design a stable representation for log=
ical / physical plan. It is doable, but we&#39;d be the first to do it.=C2=
=A0I&#39;m not sure of any mainstream databases being able to do that in th=
e past. The design of that API itself, to make
 sure we have a good story for backward and forward compatibility, would pr=
obably take months if not years. It might still be good to do, or offer an =
experimental trait without compatibility guarantee that uses the current Ca=
talyst internal logical plan.</div>
<div><br>
</div>
<div>2. Most data source developers simply want a way to offer some data, w=
ithout any pushdown. Having to understand query plans is a burden rather th=
an a gift.</div>
<div><br>
</div>
<div><br>
Re: your point about the proposed v2 being worse than v1 for your use case.=
</div>
<div><br>
</div>
<div>Can you say more? You used the argument that in v2 there are more supp=
ort for broader pushdown and as a result it is harder to implement. That=
9;s how it is supposed to be. If a data source simply implements one of the=
 trait, it&#39;d be logically identical
 to v1. I don&#39;t see why it would be worse or better, other than v2 prov=
ides much stronger forward compatibility guarantees than v1.</div>
</div>
</div>
<div>
<div>
<div><br>
</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Tue, Aug 29, 2017 at 4:54 AM, James Baker <sp=
an>
&lt;<a href=3D"mailto:j.baker@outlook.com" target=3D"_blank">j.baker@outloo=
k.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div>Copying from the code review comments I just submitted on the draft AP=
I (<a href=3D"https://github.com/cloud-fan/spark/pull/10#pullrequestreview-=
59088745" target=3D"_blank">https://github.com/cloud-fan/<wbr>spark/pull/10=
#pullrequestrevie<wbr>w-59088745</a>):
<div><br>
</div>
<div>Context here is that I&#39;ve spent some time implementing a Spark dat=
asource and have had some issues with the current API which are made worse =
in V2.<br>
<div>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
The general conclusion I=E2=80=99ve come to here is that this is very hard =
to actually implement (in a similar but more aggressive way than DataSource=
 V1, because of the extra methods and dimensions we get in V2).</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
In DataSources V1 PrunedFilteredScan, the issue is that you are passed in t=
he filters with the buildScan method, and then passed in again with the unh=
andledFilters method.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
However, the filters that you can=E2=80=99t handle might be data dependent,=
 which the current API does not handle well. Suppose I can handle filter A =
some of the time, and filter B some of the time. If I=E2=80=99m passed in b=
oth, then either A and B are unhandled, or A, or
 B, or neither. The work I have to do to work this out is essentially the s=
ame as I have to do while actually generating my RDD (essentially I have to=
 generate my partitions), so I end up doing some weird caching work.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
This V2 API proposal has the same issues, but perhaps moreso. In PrunedFilt=
eredScan, there is essentially one degree of freedom for pruning (filters),=
 so you just have to implement caching between unhandledFilters and buildSc=
an. However, here we have many degrees
 of freedom; sorts, individual filters, clustering, sampling, maybe aggrega=
tions eventually - and these operations are not all commutative, and comput=
ing my support one-by-one can easily end up being more expensive than compu=
ting all in one go.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
For some trivial examples:</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
- After filtering, I might be sorted, whilst before filtering I might not b=
e.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
- Filtering with certain filters might affect my ability to push down other=
s.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
- Filtering with aggregations (as mooted) might not be possible to push dow=
n.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p2">
And with the API as currently mooted, I need to be able to go back and chan=
ge my results because they might change later.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
Really what would be good here is to pass all of the filters and sorts etc =
all at once, and then I return the parts I can=E2=80=99t handle.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
I=E2=80=99d prefer in general that this be implemented by passing some kind=
 of query plan to the datasource which enables this kind of replacement. Ex=
plicitly don=E2=80=99t want to give the whole query plan - that sounds pain=
ful - would prefer we push down only the parts of
 the query plan we deem to be stable. With the mix-in approach, I don=E2=80=
=99t think we can guarantee the properties we want without a two-phase thin=
g - I=E2=80=99d really love to be able to just define a straightforward uni=
on type which is our supported pushdown stuff, and
 then the user can transform and return it.</p>
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
I think this ends up being a more elegant API for consumers, and also far m=
ore intuitive.</p>
<span class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790=
9505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117=
14659132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803=
040m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064=
8899296665581HOEnZb"><font color=3D"#888888">
<p class=3D"m_-1187179626336977084m_4021611465736319531m_813074496865790950=
5m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-46117146=
59132852580m_-5970601458075266271m_-895255321151999255m_2778011139056803040=
m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_736064889=
9296665581m_5653698246251898209inbox-inbox-p1">
James</p>
</font></span></div>
</div>
</div>
<div>
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-461171=
4659132852580m_-5970601458075266271m_-895255321151999255m_27780111390568030=
40m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_7360648=
899296665581h5">
<br>
<div class=3D"gmail_quote">
<div>On Mon, 28 Aug 2017 at 18:00 =E8=92=8B=E6=98=9F=E5=8D=9A &lt;<a href=
=3D"mailto:jiangxb1987@gmail.com" target=3D"_blank">jiangxb1987@gmail.com</=
a>&gt; wrote:<br>
</div>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div><span style=3D"color:rgb(49,49,49);word-spacing:1px;background-color:r=
gb(255,255,255)">+1 (Non-binding)=C2=A0</span></div>
<div dir=3D"auto"><font color=3D"#313131"><span style=3D"word-spacing:1px;b=
ackground-color:rgb(255,255,255)"><br>
</span></font>
<div class=3D"gmail_quote" dir=3D"auto">
<div>Xiao Li &lt;<a href=3D"mailto:gatorsmile@gmail.com" target=3D"_blank">=
gatorsmile@gmail.com</a>&gt;=E4=BA=8E2017=E5=B9=B48=E6=9C=88<wbr>28=E6=97=
=A5 =E5=91=A8=E4=B8=80=E4=B8=8B=E5=8D=885:38=E5=86=99=E9=81=93=EF=BC=9A<br>
</div>
</div>
</div>
</div>
<div>
<div dir=3D"auto">
<div class=3D"gmail_quote" dir=3D"auto">
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div>
<div>+1</div>
</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">2017-08-28 12:45 GMT-07:00 Cody Koeninger <span>=
&lt;<a href=3D"mailto:cody@koeninger.org" target=3D"_blank">cody@koeninger.=
org</a>&gt;</span>:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
Just wanted to point out that because the jira isn&#39;t labeled SPIP, it<b=
r>
won&#39;t have shown up linked from<br>
<br>
<a href=3D"http://spark.apache.org/improvement-proposals.html" rel=3D"noref=
errer" target=3D"_blank">http://spark.apache.org/improv<wbr>ement-proposals=
.html</a><br>
<div>
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505m_6320936559034470976m_8249949939977008092m_2510404645061772529m_-461171=
4659132852580m_-5970601458075266271m_-895255321151999255m_27780111390568030=
40m_8818098213686589754m_7375680683336113357m_-2763902308953058845m_7360648=
899296665581m_5653698246251898209m_4503355092326450206m_6435330634131679596=
h5">
<br>
On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan &lt;<a href=3D"mailto:cloud0fa=
n@gmail.com" target=3D"_blank">cloud0fan@gmail.com</a>&gt; wrote:<br>
&gt; Hi all,<br>
&gt;<br>
&gt; It has been almost 2 weeks since I proposed the data source V2 for<br>
&gt; discussion, and we already got some feedbacks on the JIRA ticket and t=
he<br>
&gt; prototype PR, so I&#39;d like to call for a vote.<br>
&gt;<br>
&gt; The full document of the Data Source API V2 is:<br>
&gt; <a href=3D"https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qd=
Q-Z8qU5Frf6WMQZ6jJVM/edit" rel=3D"noreferrer" target=3D"_blank">
https://docs.google.com/docume<wbr>nt/d/1n_vUVbF4KD3gxTmkNEon5qdQ<wbr>-Z8qU=
5Frf6WMQZ6jJVM/edit</a><br>
&gt;<br>
&gt; Note that, this vote should focus on high-level design/framework, not<=
br>
&gt; specified APIs, as we can always change/improve specified APIs during<=
br>
&gt; development.<br>
&gt;<br>
&gt; The vote will be up for the next 72 hours. Please reply with your vote=
:<br>
&gt;<br>
&gt; +1: Yeah, let&#39;s go forward and implement the SPIP.<br>
&gt; +0: Don&#39;t really care.<br>
&gt; -1: I don&#39;t think this is a good idea because of the following tec=
hnical<br>
&gt; reasons.<br>
&gt;<br>
&gt; Thanks!<br>
<br>
</div>
</div>
------------------------------<wbr>------------------------------<wbr>-----=
----<br>
To unsubscribe e-mail: <a href=3D"mailto:dev-unsubscribe@spark.apache.org" =
target=3D"_blank">
dev-unsubscribe@spark.apache.o<wbr>rg</a><br>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</blockquote>
</div>
</div></div></div>
</div>

</blockquote></div><br></div>
</blockquote></div></div></div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909505m_6=
320936559034470976HOEnZb"><font color=3D"#888888">-- <br><div class=3D"m_-1=
187179626336977084m_4021611465736319531m_8130744968657909505m_6320936559034=
470976m_8249949939977008092gmail_signature" data-smartmail=3D"gmail_signatu=
re"><div dir=3D"ltr"><div><div dir=3D"ltr">Ryan Blue<div>Software Engineer<=
/div><div><span style=3D"font-size:12.8px">Netflix</span></div></div></div>=
</div></div>
</font></span></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"m_-1187179626336977084m_4021611465736319531m_8130744968657909=
505gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><di=
v><div dir=3D"ltr">Ryan Blue<div>Software Engineer</div><div><span style=3D=
"font-size:12.8px">Netflix</span></div></div></div></div></div>
</div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"m_-1187179626336977084gmail_signature" data-smartmail=3D"gmai=
l_signature"><div dir=3D"ltr"><div><div dir=3D"ltr">Ryan Blue<div>Software =
Engineer</div><div><span style=3D"font-size:12.8px">Netflix</span></div></d=
iv></div></div></div>
</div>
</div></div></blockquote></div><br></div>

--f40304366f9a845e1b055890464a--