Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 583A6200D08 for ; Thu, 7 Sep 2017 04:33:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 56953161BBD; Thu, 7 Sep 2017 02:33:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 55D43161711 for ; Thu, 7 Sep 2017 04:33:12 +0200 (CEST) Received: (qmail 43151 invoked by uid 500); 7 Sep 2017 02:33:05 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 43140 invoked by uid 99); 7 Sep 2017 02:33:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Sep 2017 02:33:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E4C54183DEB for ; Thu, 7 Sep 2017 02:33:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.88 X-Spam-Level: ** X-Spam-Status: No, score=2.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id sU6eVwXWop7K for ; Thu, 7 Sep 2017 02:32:53 +0000 (UTC) Received: from mail-lf0-f50.google.com (mail-lf0-f50.google.com [209.85.215.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E02C05F27D for ; Thu, 7 Sep 2017 02:32:52 +0000 (UTC) Received: by mail-lf0-f50.google.com with SMTP id m199so21455365lfe.3 for ; Wed, 06 Sep 2017 19:32:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=X1EXgeGcjHBN84VdLkLJFJfITA4eMnez6vMFq21MCzs=; b=HvkVFvH/rylc8/RHigQcjin9Aik0SkXtGFGN2XG+jOWrq0iKVJMVvljtDSlAe5rQOo Zr6rU1l+p6JsYxlV7+xkEjeRsMfvTB0Yh/qedi2VbtOlW5WIDq1MzGmTEE+ffKE1Vlm4 RTO7W50P36tri8AEmLB0im45A4bA1t8LwsqaiPTBP1amd3KRB6IuleRnekp7a/r3/m1k B4N93E6g4dFp9R6eH051nzylH4zOr94Z+W55FwMswZED0eovHmk6w1ayPGHh9gQYlQgi EQEsNawqxVh93Y1cumLzi7BhjzZXSKaEGP77ngUap4D/cP3aapoqZDIMyU4x1mZmuwUB KYpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=X1EXgeGcjHBN84VdLkLJFJfITA4eMnez6vMFq21MCzs=; b=lrCEXxgXoR/U4GXXr7jnsVASLtHOmAqXL5TdUk2L2cmA74M6ruiWo3Q/73OcfNk+NT rwSZMT6Qd695nzKFRJrEBAM7LGe92D8vCjNQJxWIUawExssi3hr2/GnmtoP5ZuRUvKMd /52+OCpuGA2msEaUKqEkmggt3tYkGquCfrHVLQurm+N+T/kAo8bqGz89kAAjCW9rCD2C v8xjvwy/dJqUMLBqZUGYwat2Vba63xpQJSzHQXo/7RBGb2xSVVSmMkNFqvjHVCMBG7Z2 nlJzFevf2lFrUl6NG4yBGD+e/AE8drDRdIh/FLn43aa1uMrWhKq9M/7OCinLt9AZIgbu 65QA== X-Gm-Message-State: AHPjjUi5hFlQzU33vDYX1JHPVyoaTcbvCC9HbbwTplOlC6g/r7syGfJf 2wQH/Frh7gshqJQGXEKcdY0QjXuVJg== X-Google-Smtp-Source: ADKCNb4Um2FTlJALNjhreWuW/d8eUWtv0EtjYlETsL6hgmHoUh+9X+prLY5zNBfHs0Z2L4iliA49g8WpHfVqPbtt2Y8= X-Received: by 10.46.92.3 with SMTP id q3mr390377ljb.41.1504751565166; Wed, 06 Sep 2017 19:32:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.28.130 with HTTP; Wed, 6 Sep 2017 19:32:44 -0700 (PDT) In-Reply-To: References: From: Wenchen Fan Date: Thu, 7 Sep 2017 10:32:44 +0800 Message-ID: Subject: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 To: Ryan Blue Cc: Reynold Xin , James Baker , Spark dev list Content-Type: multipart/alternative; boundary="f40304366f9a845e1b055890464a" archived-at: Thu, 07 Sep 2017 02:33:14 -0000 --f40304366f9a845e1b055890464a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Ryan, Yea I agree with you that we should discuss some substantial details during the vote, and I addressed your comments about schema inference API in my new PR, please take a look. I've also called a new vote for the read path, please vote there, thanks! On Thu, Sep 7, 2017 at 7:55 AM, Ryan Blue wrote: > I'm all for keeping this moving and not getting too far into the details > (like naming), but I think the substantial details should be clarified > first since they are in the proposal that's being voted on. > > I would prefer moving the write side to a separate SPIP, too, since there > isn't much detail in the proposal and I think we should be more deliberat= e > with things like schema evolution. > > On Thu, Aug 31, 2017 at 10:33 AM, Wenchen Fan wrote= : > >> Hi Ryan, >> >> I think for a SPIP, we should not worry too much about details, as we ca= n >> discuss them during PR review after the vote pass. >> >> I think we should focus more on the overall design, like James did. The >> interface mix-in vs plan push down discussion was great, hope we can get= a >> consensus on this topic soon. The current proposal is, we keep the >> interface mix-in framework, and add an unstable plan push down trait. >> >> For details like interface names, sort push down vs sort propagate, etc.= , >> I think they should not block the vote, as they can be updated/improved >> within the current interface mix-in framework. >> >> About separating read/write proposals, we should definitely send >> individual PRs for read/write when developing data source v2. I'm also O= K >> with voting on the read side first. The write side is way simpler than t= he >> read side, I think it's more important to get agreement on the read side >> first. >> >> BTW, I do appreciate your feedbacks/comments on the prototype, let's kee= p >> the discussion there. In the meanwhile, let's have more discussion on th= e >> overall framework, and drive this project together. >> >> Wenchen >> >> >> >> On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue wrote: >> >>> Maybe I'm missing something, but the high-level proposal consists of: >>> Goals, Non-Goals, and Proposed API. What is there to discuss other than= the >>> details of the API that's being proposed? I think the goals make sense,= but >>> goals alone aren't enough to approve a SPIP. >>> >>> On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin >>> wrote: >>> >>>> So we seem to be getting into a cycle of discussing more about the >>>> details of APIs than the high level proposal. The details of APIs are >>>> important to debate, but those belong more in code reviews. >>>> >>>> One other important thing is that we should avoid API design by >>>> committee. While it is extremely useful to get feedback, understand th= e use >>>> cases, we cannot do API design by incorporating verbatim the union of >>>> everybody's feedback. API design is largely a tradeoff game. The most >>>> expressive API would also be harder to use, or sacrifice backward/forw= ard >>>> compatibility. It is as important to decide what to exclude as what to >>>> include. >>>> >>>> Unlike the v1 API, the way Wenchen's high level V2 framework is >>>> proposed makes it very easy to add new features (e.g. clustering >>>> properties) in the future without breaking any APIs. I'd rather us shi= pping >>>> something useful that might not be the most comprehensive set, than >>>> debating about every single feature we should add and then creating >>>> something super complicated that has unclear value. >>>> >>>> >>>> >>>> On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue wrote: >>>> >>>>> -1 (non-binding) >>>>> >>>>> Sometimes it takes a VOTE thread to get people to actually read and >>>>> comment, so thanks for starting this one=E2=80=A6 but there=E2=80=99s= still discussion >>>>> happening on the prototype API, which it hasn=E2=80=99t been updated.= I=E2=80=99d like to >>>>> see the proposal shaped by the ongoing discussion so that we have a b= etter, >>>>> more concrete plan. I think that=E2=80=99s going to produces a better= SPIP. >>>>> >>>>> The second reason for -1 is that I think the read- and write-side >>>>> proposals should be separated. The PR >>>>> currently has =E2=80=9Cw= rite >>>>> path=E2=80=9D listed as a TODO item and most of the discussion I=E2= =80=99ve seen is on the >>>>> read side. I think it would be better to separate the read and write = APIs >>>>> so we can focus on them individually. >>>>> >>>>> An example of why we should focus on the write path separately is tha= t >>>>> the proposal says this: >>>>> >>>>> Ideally partitioning/bucketing concept should not be exposed in the >>>>> Data Source API V2, because they are just techniques for data skippin= g and >>>>> pre-partitioning. However, these 2 concepts are already widely used i= n >>>>> Spark, e.g. DataFrameWriter.partitionBy and DDL syntax like ADD PARTI= TION. >>>>> To be consistent, we need to add partitioning/bucketing to Data Sourc= e V2 . >>>>> . . >>>>> >>>>> Essentially, the some APIs mix DDL and DML operations. I=E2=80=99d li= ke to >>>>> consider ways to fix that problem instead of carrying the problem for= ward >>>>> to Data Source V2. We can solve this by adding a high-level API for D= DL and >>>>> a better write/insert API that works well with it. Clearly, that disc= ussion >>>>> is independent of the read path, which is why I think separating the = two >>>>> proposals would be a win. >>>>> >>>>> rb >>>>> =E2=80=8B >>>>> >>>>> On Wed, Aug 30, 2017 at 4:28 AM, Reynold Xin >>>>> wrote: >>>>> >>>>>> That might be good to do, but seems like orthogonal to this effort >>>>>> itself. It would be a completely different interface. >>>>>> >>>>>> On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan >>>>>> wrote: >>>>>> >>>>>>> OK I agree with it, how about we add a new interface to push down >>>>>>> the query plan, based on the current framework? We can mark the >>>>>>> query-plan-push-down interface as unstable, to save the effort of d= esigning >>>>>>> a stable representation of query plan and maintaining forward compa= tibility. >>>>>>> >>>>>>> On Wed, Aug 30, 2017 at 10:53 AM, James Baker >>>>>>> wrote: >>>>>>> >>>>>>>> I'll just focus on the one-by-one thing for now - it's the thing >>>>>>>> that blocks me the most. >>>>>>>> >>>>>>>> I think the place where we're most confused here is on the cost of >>>>>>>> determining whether I can push down a filter. For me, in order to = work out >>>>>>>> whether I can push down a filter or satisfy a sort, I might have t= o read >>>>>>>> plenty of data. That said, it's worth me doing this because I can = use this >>>>>>>> information to avoid reading >>that much data. >>>>>>>> >>>>>>>> If you give me all the orderings, I will have to read that data >>>>>>>> many times (we stream it to avoid keeping it in memory). >>>>>>>> >>>>>>>> There's also a thing where our typical use cases have many filters >>>>>>>> (20+ is common). So, it's likely not going to work to pass us all = the >>>>>>>> combinations. That said, if I can tell you a cost, I know what opt= imal >>>>>>>> looks like, why can't I just pick that myself? >>>>>>>> >>>>>>>> The current design is friendly to simple datasources, but does not >>>>>>>> have the potential to support this. >>>>>>>> >>>>>>>> So the main problem we have with datasources v1 is that it's >>>>>>>> essentially impossible to leverage a bunch of Spark features - I d= on't get >>>>>>>> to use bucketing or row batches or all the nice things that I real= ly want >>>>>>>> to use to get decent performance. Provided I can leverage these in= a >>>>>>>> moderately supported way which won't break in any given commit, I'= ll be >>>>>>>> pretty happy with anything that lets me opt out of the restriction= s. >>>>>>>> >>>>>>>> My suggestion here is that if you make a mode which works well for >>>>>>>> complicated use cases, you end up being able to write simple mode = in terms >>>>>>>> of it very easily. So we could actually provide two APIs, one that= lets >>>>>>>> people who have more interesting datasources leverage the cool Spa= rk >>>>>>>> features, and one that lets people who just want to implement basi= c >>>>>>>> features do that - I'd try to include some kind of layering here. = I could >>>>>>>> probably sketch out something here if that'd be useful? >>>>>>>> >>>>>>>> James >>>>>>>> >>>>>>>> On Tue, 29 Aug 2017 at 18:59 Wenchen Fan >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi James, >>>>>>>>> >>>>>>>>> Thanks for your feedback! I think your concerns are all valid, bu= t >>>>>>>>> we need to make a tradeoff here. >>>>>>>>> >>>>>>>>> > Explicitly here, what I'm looking for is a convenient mechanism >>>>>>>>> to accept a fully specified set of arguments >>>>>>>>> >>>>>>>>> The problem with this approach is: 1) if we wanna add more >>>>>>>>> arguments in the future, it's really hard to do without changing >>>>>>>>> the existing interface. 2) if a user wants to implement a very si= mple data >>>>>>>>> source, he has to look at all the arguments and understand them, = which may >>>>>>>>> be a burden for him. >>>>>>>>> I don't have a solution to solve these 2 problems, comments are >>>>>>>>> welcome. >>>>>>>>> >>>>>>>>> >>>>>>>>> > There are loads of cases like this - you can imagine someone >>>>>>>>> being able to push down a sort before a filter is applied, but no= t >>>>>>>>> afterwards. However, maybe the filter is so selective that it's b= etter to >>>>>>>>> push down the filter and not handle the sort. I don't get to make= this >>>>>>>>> decision, Spark does (but doesn't have good enough information to= do it >>>>>>>>> properly, whilst I do). I want to be able to choose the parts I p= ush down >>>>>>>>> given knowledge of my datasource - as defined the APIs don't let = me do >>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this = way. >>>>>>>>> >>>>>>>>> This is true, the current framework applies push downs one by one= , >>>>>>>>> incrementally. If a data source wanna go back to accept a sort pu= sh down >>>>>>>>> after it accepts a filter push down, it's impossible with the cur= rent data >>>>>>>>> source V2. >>>>>>>>> Fortunately, we have a solution for this problem. At Spark side, >>>>>>>>> actually we do have a fully specified set of arguments waiting to >>>>>>>>> be pushed down, but Spark doesn't know which is the best order to= push them >>>>>>>>> into data source. Spark can try every combination and ask the dat= a source >>>>>>>>> to report a cost, then Spark can pick the best combination with t= he lowest >>>>>>>>> cost. This can also be implemented as a cost report interface, so= that >>>>>>>>> advanced data source can implement it for optimal performance, an= d simple >>>>>>>>> data source doesn't need to care about it and keep simple. >>>>>>>>> >>>>>>>>> >>>>>>>>> The current design is very friendly to simple data source, and ha= s >>>>>>>>> the potential to support complex data source, I prefer the curren= t design >>>>>>>>> over the plan push down one. What do you think? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 30, 2017 at 5:53 AM, James Baker >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Yeah, for sure. >>>>>>>>>> >>>>>>>>>> With the stable representation - agree that in the general case >>>>>>>>>> this is pretty intractable, it restricts the modifications that = you can do >>>>>>>>>> in the future too much. That said, it shouldn't be as hard if yo= u restrict >>>>>>>>>> yourself to the parts of the plan which are supported by the dat= asources V2 >>>>>>>>>> API (which after all, need to be translateable properly into the= future to >>>>>>>>>> support the mixins proposed). This should have a pretty small sc= ope in >>>>>>>>>> comparison. As long as the user can bail out of nodes they don't >>>>>>>>>> understand, they should be ok, right? >>>>>>>>>> >>>>>>>>>> That said, what would also be fine for us is a place to plug int= o >>>>>>>>>> an unstable query plan. >>>>>>>>>> >>>>>>>>>> Explicitly here, what I'm looking for is a convenient mechanism >>>>>>>>>> to accept a fully specified set of arguments (of which I can cho= ose to >>>>>>>>>> ignore some), and return the information as to which of them I'm= ignoring. >>>>>>>>>> Taking a query plan of sorts is a way of doing this which IMO is= intuitive >>>>>>>>>> to the user. It also provides a convenient location to plug in t= hings like >>>>>>>>>> stats. Not at all married to the idea of using a query plan here= ; it just >>>>>>>>>> seemed convenient. >>>>>>>>>> >>>>>>>>>> Regarding the users who just want to be able to pump data into >>>>>>>>>> Spark, my understanding is that replacing isolated nodes in a qu= ery plan is >>>>>>>>>> easy. That said, our goal here is to be able to push down as muc= h as >>>>>>>>>> possible into the underlying datastore. >>>>>>>>>> >>>>>>>>>> To your second question: >>>>>>>>>> >>>>>>>>>> The issue is that if you build up pushdowns incrementally and no= t >>>>>>>>>> all at once, you end up having to reject pushdowns and filters t= hat you >>>>>>>>>> actually can do, which unnecessarily increases overheads. >>>>>>>>>> >>>>>>>>>> For example, the dataset >>>>>>>>>> >>>>>>>>>> a b c >>>>>>>>>> 1 2 3 >>>>>>>>>> 1 3 3 >>>>>>>>>> 1 3 4 >>>>>>>>>> 2 1 1 >>>>>>>>>> 2 0 1 >>>>>>>>>> >>>>>>>>>> can efficiently push down sort(b, c) if I have already applied >>>>>>>>>> the filter a =3D 1, but otherwise will force a sort in Spark. On= the PR I >>>>>>>>>> detail a case I see where I can push down two equality filters i= ff I am >>>>>>>>>> given them at the same time, whilst not being able to one at a t= ime. >>>>>>>>>> >>>>>>>>>> There are loads of cases like this - you can imagine someone >>>>>>>>>> being able to push down a sort before a filter is applied, but n= ot >>>>>>>>>> afterwards. However, maybe the filter is so selective that it's = better to >>>>>>>>>> push down the filter and not handle the sort. I don't get to mak= e this >>>>>>>>>> decision, Spark does (but doesn't have good enough information t= o do it >>>>>>>>>> properly, whilst I do). I want to be able to choose the parts I = push down >>>>>>>>>> given knowledge of my datasource - as defined the APIs don't let= me do >>>>>>>>>> that, they're strictly more restrictive than the V1 APIs in this= way. >>>>>>>>>> >>>>>>>>>> The pattern of not considering things that can be done in bulk >>>>>>>>>> bites us in other ways. The retrieval methods end up being trick= ier to >>>>>>>>>> implement than is necessary because frequently a single operatio= n provides >>>>>>>>>> the result of many of the getters, but the state is mutable, so = you end up >>>>>>>>>> with odd caches. >>>>>>>>>> >>>>>>>>>> For example, the work I need to do to answer unhandledFilters in >>>>>>>>>> V1 is roughly the same as the work I need to do to buildScan, so= I want to >>>>>>>>>> cache it. This means that I end up with code that looks like: >>>>>>>>>> >>>>>>>>>> public final class CachingFoo implements Foo { >>>>>>>>>> private final Foo delegate; >>>>>>>>>> >>>>>>>>>> private List currentFilters =3D emptyList(); >>>>>>>>>> private Supplier barSupplier =3D >>>>>>>>>> newSupplier(currentFilters); >>>>>>>>>> >>>>>>>>>> public CachingFoo(Foo delegate) { >>>>>>>>>> this.delegate =3D delegate; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> private Supplier newSupplier(List filters) { >>>>>>>>>> return Suppliers.memoize(() -> >>>>>>>>>> delegate.computeBar(filters)); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @Override >>>>>>>>>> public Bar computeBar(List filters) { >>>>>>>>>> if (!filters.equals(currentFilters)) { >>>>>>>>>> currentFilters =3D filters; >>>>>>>>>> barSupplier =3D newSupplier(filters); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> return barSupplier.get(); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> which caches the result required in unhandledFilters on the >>>>>>>>>> expectation that Spark will call buildScan afterwards and get to= use the >>>>>>>>>> result.. >>>>>>>>>> >>>>>>>>>> This kind of cache becomes more prominent, but harder to deal >>>>>>>>>> with in the new APIs. As one example here, the state I will need= in order >>>>>>>>>> to compute accurate column stats internally will likely be a sub= set of the >>>>>>>>>> work required in order to get the read tasks, tell you if I can = handle >>>>>>>>>> filters, etc, so I'll want to cache them for reuse. However, the= cached >>>>>>>>>> information needs to be appropriately invalidated when I add a n= ew filter >>>>>>>>>> or sort order or limit, and this makes implementing the APIs har= der and >>>>>>>>>> more error-prone. >>>>>>>>>> >>>>>>>>>> One thing that'd be great is a defined contract of the order in >>>>>>>>>> which Spark calls the methods on your datasource (ideally this c= ontract >>>>>>>>>> could be implied by the way the Java class structure works, but = otherwise I >>>>>>>>>> can just throw). >>>>>>>>>> >>>>>>>>>> James >>>>>>>>>> >>>>>>>>>> On Tue, 29 Aug 2017 at 02:56 Reynold Xin >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> James, >>>>>>>>>>> >>>>>>>>>>> Thanks for the comment. I think you just pointed out a trade-of= f >>>>>>>>>>> between expressiveness and API simplicity, compatibility and ev= olvability. >>>>>>>>>>> For the max expressiveness, we'd want the ability to expose ful= l query >>>>>>>>>>> plans, and let the data source decide which part of the query p= lan can be >>>>>>>>>>> pushed down. >>>>>>>>>>> >>>>>>>>>>> The downside to that (full query plan push down) are: >>>>>>>>>>> >>>>>>>>>>> 1. It is extremely difficult to design a stable representation >>>>>>>>>>> for logical / physical plan. It is doable, but we'd be the firs= t to do >>>>>>>>>>> it. I'm not sure of any mainstream databases being able to do t= hat in the >>>>>>>>>>> past. The design of that API itself, to make sure we have a goo= d story for >>>>>>>>>>> backward and forward compatibility, would probably take months = if not >>>>>>>>>>> years. It might still be good to do, or offer an experimental t= rait without >>>>>>>>>>> compatibility guarantee that uses the current Catalyst internal= logical >>>>>>>>>>> plan. >>>>>>>>>>> >>>>>>>>>>> 2. Most data source developers simply want a way to offer some >>>>>>>>>>> data, without any pushdown. Having to understand query plans is= a burden >>>>>>>>>>> rather than a gift. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Re: your point about the proposed v2 being worse than v1 for >>>>>>>>>>> your use case. >>>>>>>>>>> >>>>>>>>>>> Can you say more? You used the argument that in v2 there are >>>>>>>>>>> more support for broader pushdown and as a result it is harder = to >>>>>>>>>>> implement. That's how it is supposed to be. If a data source si= mply >>>>>>>>>>> implements one of the trait, it'd be logically identical to v1.= I don't see >>>>>>>>>>> why it would be worse or better, other than v2 provides much st= ronger >>>>>>>>>>> forward compatibility guarantees than v1. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Aug 29, 2017 at 4:54 AM, James Baker < >>>>>>>>>>> j.baker@outlook.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Copying from the code review comments I just submitted on the >>>>>>>>>>>> draft API (https://github.com/cloud-fan/ >>>>>>>>>>>> spark/pull/10#pullrequestreview-59088745): >>>>>>>>>>>> >>>>>>>>>>>> Context here is that I've spent some time implementing a Spark >>>>>>>>>>>> datasource and have had some issues with the current API which= are made >>>>>>>>>>>> worse in V2. >>>>>>>>>>>> >>>>>>>>>>>> The general conclusion I=E2=80=99ve come to here is that this = is very >>>>>>>>>>>> hard to actually implement (in a similar but more aggressive w= ay than >>>>>>>>>>>> DataSource V1, because of the extra methods and dimensions we = get in V2). >>>>>>>>>>>> >>>>>>>>>>>> In DataSources V1 PrunedFilteredScan, the issue is that you ar= e >>>>>>>>>>>> passed in the filters with the buildScan method, and then pass= ed in again >>>>>>>>>>>> with the unhandledFilters method. >>>>>>>>>>>> >>>>>>>>>>>> However, the filters that you can=E2=80=99t handle might be da= ta >>>>>>>>>>>> dependent, which the current API does not handle well. Suppose= I can handle >>>>>>>>>>>> filter A some of the time, and filter B some of the time. If I= =E2=80=99m passed in >>>>>>>>>>>> both, then either A and B are unhandled, or A, or B, or neithe= r. The work I >>>>>>>>>>>> have to do to work this out is essentially the same as I have = to do while >>>>>>>>>>>> actually generating my RDD (essentially I have to generate my = partitions), >>>>>>>>>>>> so I end up doing some weird caching work. >>>>>>>>>>>> >>>>>>>>>>>> This V2 API proposal has the same issues, but perhaps moreso. >>>>>>>>>>>> In PrunedFilteredScan, there is essentially one degree of free= dom for >>>>>>>>>>>> pruning (filters), so you just have to implement caching betwe= en >>>>>>>>>>>> unhandledFilters and buildScan. However, here we have many deg= rees of >>>>>>>>>>>> freedom; sorts, individual filters, clustering, sampling, mayb= e >>>>>>>>>>>> aggregations eventually - and these operations are not all com= mutative, and >>>>>>>>>>>> computing my support one-by-one can easily end up being more e= xpensive than >>>>>>>>>>>> computing all in one go. >>>>>>>>>>>> >>>>>>>>>>>> For some trivial examples: >>>>>>>>>>>> >>>>>>>>>>>> - After filtering, I might be sorted, whilst before filtering = I >>>>>>>>>>>> might not be. >>>>>>>>>>>> >>>>>>>>>>>> - Filtering with certain filters might affect my ability to >>>>>>>>>>>> push down others. >>>>>>>>>>>> >>>>>>>>>>>> - Filtering with aggregations (as mooted) might not be possibl= e >>>>>>>>>>>> to push down. >>>>>>>>>>>> >>>>>>>>>>>> And with the API as currently mooted, I need to be able to go >>>>>>>>>>>> back and change my results because they might change later. >>>>>>>>>>>> >>>>>>>>>>>> Really what would be good here is to pass all of the filters >>>>>>>>>>>> and sorts etc all at once, and then I return the parts I can= =E2=80=99t handle. >>>>>>>>>>>> >>>>>>>>>>>> I=E2=80=99d prefer in general that this be implemented by pass= ing some >>>>>>>>>>>> kind of query plan to the datasource which enables this kind o= f >>>>>>>>>>>> replacement. Explicitly don=E2=80=99t want to give the whole q= uery plan - that >>>>>>>>>>>> sounds painful - would prefer we push down only the parts of t= he query plan >>>>>>>>>>>> we deem to be stable. With the mix-in approach, I don=E2=80=99= t think we can >>>>>>>>>>>> guarantee the properties we want without a two-phase thing - I= =E2=80=99d really >>>>>>>>>>>> love to be able to just define a straightforward union type wh= ich is our >>>>>>>>>>>> supported pushdown stuff, and then the user can transform and = return it. >>>>>>>>>>>> >>>>>>>>>>>> I think this ends up being a more elegant API for consumers, >>>>>>>>>>>> and also far more intuitive. >>>>>>>>>>>> >>>>>>>>>>>> James >>>>>>>>>>>> >>>>>>>>>>>> On Mon, 28 Aug 2017 at 18:00 =E8=92=8B=E6=98=9F=E5=8D=9A wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 (Non-binding) >>>>>>>>>>>>> >>>>>>>>>>>>> Xiao Li =E4=BA=8E2017=E5=B9=B48=E6=9C= =8828=E6=97=A5 =E5=91=A8=E4=B8=80=E4=B8=8B=E5=8D=885:38=E5=86=99=E9=81=93= =EF=BC=9A >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2017-08-28 12:45 GMT-07:00 Cody Koeninger >>>>>>>>>>>>> >: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Just wanted to point out that because the jira isn't labele= d >>>>>>>>>>>>>>> SPIP, it >>>>>>>>>>>>>>> won't have shown up linked from >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> http://spark.apache.org/improvement-proposals.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan < >>>>>>>>>>>>>>> cloud0fan@gmail.com> wrote: >>>>>>>>>>>>>>> > Hi all, >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > It has been almost 2 weeks since I proposed the data >>>>>>>>>>>>>>> source V2 for >>>>>>>>>>>>>>> > discussion, and we already got some feedbacks on the JIRA >>>>>>>>>>>>>>> ticket and the >>>>>>>>>>>>>>> > prototype PR, so I'd like to call for a vote. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > The full document of the Data Source API V2 is: >>>>>>>>>>>>>>> > https://docs.google.com/docume >>>>>>>>>>>>>>> nt/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Note that, this vote should focus on high-level >>>>>>>>>>>>>>> design/framework, not >>>>>>>>>>>>>>> > specified APIs, as we can always change/improve specified >>>>>>>>>>>>>>> APIs during >>>>>>>>>>>>>>> > development. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > The vote will be up for the next 72 hours. Please reply >>>>>>>>>>>>>>> with your vote: >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > +1: Yeah, let's go forward and implement the SPIP. >>>>>>>>>>>>>>> > +0: Don't really care. >>>>>>>>>>>>>>> > -1: I don't think this is a good idea because of the >>>>>>>>>>>>>>> following technical >>>>>>>>>>>>>>> > reasons. >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > Thanks! >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -----------------------------------------------------------= - >>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Software Engineer >>>>> Netflix >>>>> >>>> >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> > > > -- > Ryan Blue > Software Engineer > Netflix > --f40304366f9a845e1b055890464a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Ryan,

Yea I agree with you that we s= hould discuss some=C2=A0substantial detail= s during the vote, and I addressed your comments about schema inference API= in my new PR, please take a look.

I've a= lso called a new vote for the read path, please vote there, thanks!<= /div>

On Thu= , Sep 7, 2017 at 7:55 AM, Ryan Blue <rblue@netflix.com> wrot= e:
I'm all for keepi= ng this moving and not getting too far into the details (like naming), but = I think the substantial details should be clarified first since they are in= the proposal that's being voted on.

I would prefer = moving the write side to a separate SPIP, too, since there isn't much d= etail in the proposal and I think we should be more deliberate with things = like schema evolution.
<= div class=3D"gmail_extra">
On Thu, Aug 31, 20= 17 at 10:33 AM, Wenchen Fan <cloud0fan@gmail.com> wrote:
Hi Ryan,

I think for a SPIP, we should not worry too much about details, as we ca= n discuss them during PR review after the vote pass.

I think we should focus more on the overall design, like James did. The = interface mix-in vs plan push down discussion was great, hope we can get a = consensus on this topic soon. The current proposal is, we keep the interfac= e mix-in framework, and add an unstable plan push down trait.
For details like interface names, sort push down vs sort propag= ate, etc., I think they should not block the vote, as they can be updated/i= mproved within the current interface mix-in framework.

=
About separating read/write proposals, we should definitely send indiv= idual PRs for read/write when developing data source v2. I'm also OK wi= th voting on the read side first. The write side is way simpler than the re= ad side, I think it's more important to get agreement on the=C2=A0read = side first.

BTW, I do appreciate your feedbacks/co= mments on the prototype, let's keep the discussion there. In the meanwh= ile, let's have more discussion on the overall framework, and drive thi= s project together.

Wenchen


On Thu, Aug 31, 2017 at 6:22 AM, Ryan Blue <rblue@netfl= ix.com> wrote:
Maybe I'm missing something, but the high-level proposal consists = of: Goals, Non-Goals, and Proposed API. What is there to discuss other than= the details of the API that's being proposed? I think the goals make s= ense, but goals alone aren't enough to approve a SPIP.

On Wed, Aug 30, 2017 at 2:46 PM, Reynold Xin <rx= in@databricks.com> wrote:
<= div dir=3D"ltr">So we seem to be getting into a cycle of discussing more ab= out the details of APIs than the high level proposal. The details of APIs a= re important to debate, but those belong more in code reviews.

One other important thing is that we should avoid API design by comm= ittee. While it is extremely useful to get feedback, understand the use cas= es, we cannot do API design by incorporating verbatim the union of everybod= y's feedback. API design is largely a tradeoff game. The most expressiv= e API would also be harder to use, or sacrifice backward/forward compatibil= ity. It is as important to decide what to exclude as what to include.
=

Unlike the v1 API, the way Wenchen's high level V2 = framework is proposed makes it very easy to add new features (e.g. clusteri= ng properties) in the future without breaking any APIs. I'd rather us s= hipping something useful that might not be the most comprehensive set, than= debating about every single feature we should add and then creating someth= ing super complicated that has unclear value.


=

On Wed, Aug 30, 2017 at 6:37 PM, Ryan Blue &= lt;rblue@netflix.com= > wrote:
<= div class=3D"m_-1187179626336977084m_4021611465736319531m_81307449686579095= 05m_6320936559034470976m_8249949939977008092markdown-here-wrapper">

-1 (non-binding)

Sometimes it takes a VOTE threa= d to get people to actually read and comment, so thanks for starting this o= ne=E2=80=A6 but there=E2=80=99s still discussion happening on the prototype= API, which it hasn=E2=80=99t been updated. I=E2=80=99d like to see the pro= posal shaped by the ongoing discussion so that we have a better, more concr= ete plan. I think that=E2=80=99s going to produces a better SPIP.

The second reason for -1 is tha= t I think the read- and write-side proposals should be separated. The PR = currently has =E2=80=9Cwrite path=E2=80=9D listed as a TODO item and most o= f the discussion I=E2=80=99ve seen is on the read side. I think it would be= better to separate the read and write APIs so we can focus on them individ= ually.

An example of why we should foc= us on the write path separately is that the proposal says this:

Ideally partitioning/bucketing = concept should not be exposed in the Data Source API V2, because they are j= ust techniques for data skipping and pre-partitioning. However, these 2 con= cepts are already widely used in Spark, e.g. DataFrameWriter.partitionBy an= d DDL syntax like ADD PARTITION. To be consistent, we need to add partition= ing/bucketing to Data Source V2 . . .

Essentially, the some APIs mix = DDL and DML operations. I=E2=80=99d like to consider ways to fix that probl= em instead of carrying the problem forward to Data Source V2. We can solve = this by adding a high-level API for DDL and a better write/insert API that = works well with it. Clearly, that discussion is independent of the read pat= h, which is why I think separating the two proposals would be a win.

rb

=E2=80=8B
=

On Wed, Aug= 30, 2017 at 4:28 AM, Reynold Xin <rxin@databricks.com> wr= ote:
That might be= good to do, but seems like orthogonal to this effort itself. It would be a= completely different interface.=C2=A0

On Wed, Aug 30, = 2017 at 1:10 PM Wenchen Fan <cloud0fan@gmail.com> wrote:
OK I agree with it, how about we add a new interface to = push down the query plan, based on the current framework? We can mark the q= uery-plan-push-down interface as unstable, to save the effort of designing = a stable representation of query plan and maintaining forward compatibility= .

On Wed, Au= g 30, 2017 at 10:53 AM, James Baker <j.baker@outlook.com> wrote:
I'll just focus on the one-by-one thing for now - it's t= he thing that blocks me the most.

I think the place where we're most confused here is on the cost of= determining whether I can push down a filter. For me, in order to work out= whether I can push down a filter or satisfy a sort, I might have to read p= lenty of data. That said, it's worth me doing this because I can use this information to avoid reading >>= that much data.

If you give me all the orderings, I will have to read that data many t= imes (we stream it to avoid keeping it in memory).

There's also a thing where our typical use cases have many filters= (20+ is common). So, it's likely not going to work to pass us all the = combinations. That said, if I can tell you a cost, I know what optimal look= s like, why can't I just pick that myself?

The current design is friendly to simple datasources, but does not hav= e the potential to support this.

So the main problem we have with datasources v1 is that it's essen= tially impossible to leverage a bunch of Spark features - I don't get t= o use bucketing or row batches or all the nice things that I really want to= use to get decent performance. Provided I can leverage these in a moderately supported way which won't break i= n any given commit, I'll be pretty happy with anything that lets me opt= out of the restrictions.

My suggestion here is that if you make a mode which works well for com= plicated use cases, you end up being able to write simple mode in terms of = it very easily. So we could actually provide two APIs, one that lets people= who have more interesting datasources leverage the cool Spark features, and one that lets people who just want t= o implement basic features do that - I'd try to include some kind of la= yering here. I could probably sketch out something here if that'd be us= eful?

James

On Tue, 29 Aug 2017 at 18:59 Wenchen Fan <cloud0fan@gmail.com> wrote:
Hi James,

Thanks for your feedback! I think your concerns are all valid, but we = need to make a tradeoff here.

>=C2=A0Explicitly here, what I'= ;m looking for is a convenient mechanism to accept a fully specified set of= arguments=C2=A0

The problem with this approach is:=C2= =A01) if we wanna add more=C2=A0arg= uments=C2=A0in the future, it's= really hard to do without changing the existing interface. 2) if a user wants to implement a very simple data source, he has to look = at all the arguments and understand them, which may be a burden for him.
I don't have a solution to solve = these 2 problems, comments are welcome.


>=C2=A0There are loads of cases li= ke this - you can imagine someone being able to push down a sort before a f= ilter is applied, but not afterwards. However, maybe the filter is so selec= tive that it's better to push down the filter and not handle the sort. I don't get to make this decision, Spark does= (but doesn't have good enough information to do it properly, whilst I = do). I want to be able to choose the parts I push down given knowledge of m= y datasource - as defined the APIs don't let me do that, they're strictly more restrictive than the V1 APIs in = this way.

This is true, the current framework a= pplies push downs one by one, incrementally. If a data source wanna go back= to accept a sort push down after it accepts a filter push down, it's i= mpossible with the current data source V2.
Fortunately, we have a solution for t= his problem. At Spark side, actually we do have a fully=C2=A0specified set of arguments=C2=A0waiting to be pushed down, but Spark doesn't know which is the best order to push them into data = source. Spark can try every combination and ask the data source to report a= cost, then Spark can pick the best combination with the lowest cost. This = can also be implemented as a cost report interface, so that advanced data source can implement it for optimal perfo= rmance, and simple data source doesn't need to care about it and keep s= imple.


The current design is very friendly t= o simple data source, and has the potential to support complex data source,= I prefer the current design over the plan push down one. What do you think= ?


On Wed, Aug 30, 2017 at 5:53 AM, James Baker <j.baker@outloo= k.com> wrote:
Yeah, for sure.

With the stable representation - agree that in the general case this i= s pretty intractable, it restricts the modifications that you can do in the= future too much. That said, it shouldn't be as hard if you restrict yo= urself to the parts of the plan which are supported by the datasources V2 API (which after all, need to be trans= lateable properly into the future to support the mixins proposed). This sho= uld have a pretty small scope in comparison. As long as the user can bail o= ut of nodes they don't understand, they should be ok, right?

That said, what would also be fine for us is a place to plug into an u= nstable query plan.

Explicitly here, what I'm looking for is a convenient mechanism to= accept a fully specified set of arguments (of which I can choose to ignore= some), and return the information as to which of them I'm ignoring. Ta= king a query plan of sorts is a way of doing this which IMO is intuitive to the user. It also provides a convenient loc= ation to plug in things like stats. Not at all married to the idea of using= a query plan here; it just seemed convenient.

Regarding the users who just want to be able to pump data into Spark, = my understanding is that replacing isolated nodes in a query plan is easy. = That said, our goal here is to be able to push down as much as possible int= o the underlying datastore.

To your second question:

The issue is that if you build up pushdowns incrementally and not all = at once, you end up having to reject pushdowns and filters that you actuall= y can do, which unnecessarily increases overheads.

For example, the dataset

a b c
1 2 3
1 3 3
1 3 4
2 1 1
2 0 1

can efficiently push down sort(b, c) if I have already applied the fil= ter a =3D 1, but otherwise will force a sort in Spark. On the PR I detail a= case I see where I can push down two equality filters iff I am given them = at the same time, whilst not being able to one at a time.

There are loads of cases like this - you can imagine someone being abl= e to push down a sort before a filter is applied, but not afterwards. Howev= er, maybe the filter is so selective that it's better to push down the = filter and not handle the sort. I don't get to make this decision, Spark does (but doesn't have good enough in= formation to do it properly, whilst I do). I want to be able to choose the = parts I push down given knowledge of my datasource - as defined the APIs do= n't let me do that, they're strictly more restrictive than the V1 APIs in this way.

The pattern of not considering things that can be done in bulk bites u= s in other ways. The retrieval methods end up being trickier to implement t= han is necessary because frequently a single operation provides the result = of many of the getters, but the state is mutable, so you end up with odd caches.

For example, the work I need to do to answer unhandledFilters in V1 is= roughly the same as the work I need to do to buildScan, so I want to cache= it. This means that I end up with code that looks like:

public final class CachingFoo implements Foo {
=C2=A0 =C2=A0 private final Foo delegate;

=C2=A0 =C2=A0 private List<Filter> currentFilters =3D emptyList(= );
=C2=A0 =C2=A0 private Supplier<Bar> barSupplier =3D newSupplier(= currentFilters);

=C2=A0 =C2=A0 public CachingFoo(Foo delegate) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 this.delegate =3D delegate;
=C2=A0 =C2=A0 }

=C2=A0 =C2=A0 private Supplier<Bar> newSupplier(List<Filter&g= t; filters) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 return Suppliers.memoize(() -> delegate= .computeBar(filters));
=C2=A0 =C2=A0 }

=C2=A0 =C2=A0 @Override
=C2=A0 =C2=A0 public Bar computeBar(List<Filter> filters) {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 if (!filters.equals(currentFilters)) = {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 currentFilters =3D filters;<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 barSupplier =3D newSupplier(= filters);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 }

=C2=A0 =C2=A0 =C2=A0 =C2=A0 return barSupplier.get();
=C2=A0 =C2=A0 }
}

which caches the result required in unhandledFilters on the expectatio= n that Spark will call buildScan afterwards and get to use the result..

This kind of cache becomes more prominent, but harder to deal with in = the new APIs. As one example here, the state I will need in order to comput= e accurate column stats internally will likely be a subset of the work requ= ired in order to get the read tasks, tell you if I can handle filters, etc, so I'll want to cache them for = reuse. However, the cached information needs to be appropriately invalidate= d when I add a new filter or sort order or limit, and this makes implementi= ng the APIs harder and more error-prone.

One thing that'd be great is a defined contract of the order in wh= ich Spark calls the methods on your datasource (ideally this contract could= be implied by the way the Java class structure works, but otherwise I can = just throw).

James

On Tue, 29 Aug 2017 at 02:56 Reynold Xin <rxin@databricks.com> wrote:
James,

Thanks for the comment. I think you just pointed out a trade-off betwe= en expressiveness and API simplicity, compatibility and evolvability. For t= he max expressiveness, we'd want the ability to expose full query plans= , and let the data source decide which part of the query plan can be pushed down.

The downside to that (full query plan push down) are:

1. It is extremely difficult to design a stable representation for log= ical / physical plan. It is doable, but we'd be the first to do it.=C2= =A0I'm not sure of any mainstream databases being able to do that in th= e past. The design of that API itself, to make sure we have a good story for backward and forward compatibility, would pr= obably take months if not years. It might still be good to do, or offer an = experimental trait without compatibility guarantee that uses the current Ca= talyst internal logical plan.

2. Most data source developers simply want a way to offer some data, w= ithout any pushdown. Having to understand query plans is a burden rather th= an a gift.


Re: your point about the proposed v2 being worse than v1 for your use case.=

Can you say more? You used the argument that in v2 there are more supp= ort for broader pushdown and as a result it is harder to implement. That= 9;s how it is supposed to be. If a data source simply implements one of the= trait, it'd be logically identical to v1. I don't see why it would be worse or better, other than v2 prov= ides much stronger forward compatibility guarantees than v1.


On Tue, Aug 29, 2017 at 4:54 AM, James Baker <j.baker@outloo= k.com> wrote:
Copying from the code review comments I just submitted on the draft AP= I (https://github.com/cloud-fan/spark/pull/10= #pullrequestreview-59088745):

Context here is that I've spent some time implementing a Spark dat= asource and have had some issues with the current API which are made worse = in V2.

The general conclusion I=E2=80=99ve come to here is that this is very hard = to actually implement (in a similar but more aggressive way than DataSource= V1, because of the extra methods and dimensions we get in V2).

In DataSources V1 PrunedFilteredScan, the issue is that you are passed in t= he filters with the buildScan method, and then passed in again with the unh= andledFilters method.

However, the filters that you can=E2=80=99t handle might be data dependent,= which the current API does not handle well. Suppose I can handle filter A = some of the time, and filter B some of the time. If I=E2=80=99m passed in b= oth, then either A and B are unhandled, or A, or B, or neither. The work I have to do to work this out is essentially the s= ame as I have to do while actually generating my RDD (essentially I have to= generate my partitions), so I end up doing some weird caching work.

This V2 API proposal has the same issues, but perhaps moreso. In PrunedFilt= eredScan, there is essentially one degree of freedom for pruning (filters),= so you just have to implement caching between unhandledFilters and buildSc= an. However, here we have many degrees of freedom; sorts, individual filters, clustering, sampling, maybe aggrega= tions eventually - and these operations are not all commutative, and comput= ing my support one-by-one can easily end up being more expensive than compu= ting all in one go.

For some trivial examples:

- After filtering, I might be sorted, whilst before filtering I might not b= e.

- Filtering with certain filters might affect my ability to push down other= s.

- Filtering with aggregations (as mooted) might not be possible to push dow= n.

And with the API as currently mooted, I need to be able to go back and chan= ge my results because they might change later.

Really what would be good here is to pass all of the filters and sorts etc = all at once, and then I return the parts I can=E2=80=99t handle.

I=E2=80=99d prefer in general that this be implemented by passing some kind= of query plan to the datasource which enables this kind of replacement. Ex= plicitly don=E2=80=99t want to give the whole query plan - that sounds pain= ful - would prefer we push down only the parts of the query plan we deem to be stable. With the mix-in approach, I don=E2=80= =99t think we can guarantee the properties we want without a two-phase thin= g - I=E2=80=99d really love to be able to just define a straightforward uni= on type which is our supported pushdown stuff, and then the user can transform and return it.

I think this ends up being a more elegant API for consumers, and also far m= ore intuitive.

James


On Mon, 28 Aug 2017 at 18:00 =E8=92=8B=E6=98=9F=E5=8D=9A <jiangxb1987@gmail.com> wrote:
+1 (Non-binding)=C2=A0

Xiao Li <= gatorsmile@gmail.com>=E4=BA=8E2017=E5=B9=B48=E6=9C=8828=E6=97= =A5 =E5=91=A8=E4=B8=80=E4=B8=8B=E5=8D=885:38=E5=86=99=E9=81=93=EF=BC=9A
+1

2017-08-28 12:45 GMT-07:00 Cody Koeninger = <cody@koeninger.= org>:
Just wanted to point out that because the jira isn't labeled SPIP, it won't have shown up linked from

http://spark.apache.org/improvement-proposals= .html

On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan <cloud0fan@gmail.com> wrote:
> Hi all,
>
> It has been almost 2 weeks since I proposed the data source V2 for
> discussion, and we already got some feedbacks on the JIRA ticket and t= he
> prototype PR, so I'd like to call for a vote.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU= 5Frf6WMQZ6jJVM/edit
>
> Note that, this vote should focus on high-level design/framework, not<= br> > specified APIs, as we can always change/improve specified APIs during<= br> > development.
>
> The vote will be up for the next 72 hours. Please reply with your vote= :
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following tec= hnical
> reasons.
>
> Thanks!

-----------------------------------------------------------------= ----
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org








--
Ryan Blue
Software Engineer<= /div>
Netflix
=




--
=
Ryan Blue
Software Engineer
Netflix




--
=
Ryan Blue
Software = Engineer
Netflix

--f40304366f9a845e1b055890464a--