spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From vaquar khan <vaquar.k...@gmail.com>
Subject Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
Date Sun, 10 Sep 2017 15:15:27 GMT
+1

Regards,
Vaquar khan

On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbplmp@live.com> wrote:

> +1
> ------------------------------
> *From:* wangzhenhua (G) <wangzhenhua@huawei.com>
> *Sent:* Friday, September 8, 2017 2:20:07 AM
> *To:* Dongjoon Hyun; 蒋星博
> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
> +1 (non-binding)  Great to see data source API is going to be improved!
>
>
>
> best regards,
>
> -Zhenhua(Xander)
>
>
>
> *发件人:* Dongjoon Hyun [mailto:dongjoon.hyun@gmail.com]
> *发送时间:* 2017年9月8日 4:07
> *收件人:* 蒋星博
> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>
>
>
> +1 (non-binding).
>
>
>
> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1987@gmail.com> wrote:
>
> +1
>
>
>
>
>
> Reynold Xin <rxin@databricks.com>于2017年9月7日 周四下午12:04写道:
>
> +1 as well
>
>
>
> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <michael@databricks.com>
> wrote:
>
> +1
>
>
>
> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rblue@netflix.com.invalid>
> wrote:
>
> +1 (non-binding)
>
> Thanks for making the updates reflected in the current PR. It would be
> great to see the doc updated before it is finally published though.
>
> Right now it feels like this SPIP is focused more on getting the basics
> right for what many datasources are already doing in API V1 combined with
> other private APIs, vs pushing forward state of the art for performance.
>
> I think that’s the right approach for this SPIP. We can add the support
> you’re talking about later with a more specific plan that doesn’t block
> fixing the problems that this addresses.
>
> ​
>
>
>
> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
> hvanhovell@databricks.com> wrote:
>
> +1 (binding)
>
>
>
> I personally believe that there is quite a big difference between having a
> generic data source interface with a low surface area and pushing down a
> significant part of query processing into a datasource. The later has much
> wider wider surface area and will require us to stabilize most of the
> internal catalyst API's which will be a significant burden on the community
> to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
>
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <andrew@andrewash.com> wrote:
>
> +0 (non-binding)
>
>
>
> I think there are benefits to unifying all the Spark-internal datasources
> into a common public API for sure.  It will serve as a forcing function to
> ensure that those internal datasources aren't advantaged vs datasources
> developed externally as plugins to Spark, and that all Spark features are
> available to all datasources.
>
>
>
> But I also think this read-path proposal avoids the more difficult
> questions around how to continue pushing datasource performance forwards.
> James Baker (my colleague) had a number of questions about advanced
> pushdowns (combined sorting and filtering), and Reynold also noted that
> pushdown of aggregates and joins are desirable on longer timeframes as
> well.  The Spark community saw similar requests, for aggregate pushdown in
> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
> in SPARK-12449.  Clearly a number of people are interested in this kind of
> performance work for datasources.
>
>
>
> To leave enough space for datasource developers to continue experimenting
> with advanced interactions between Spark and their datasources, I'd propose
> we leave some sort of escape valve that enables these datasources to keep
> pushing the boundaries without forking Spark.  Possibly that looks like an
> additional unsupported/unstable interface that pushes down an entire
> (unstable API) logical plan, which is expected to break API on every
> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
> ignores it and continues on with the rest of the V2 API for
> compatibility).  Or maybe it looks like something else that we don't know
> of yet.  Possibly this falls outside of the desired goals for the V2 API
> and instead should be a separate SPIP.
>
>
>
> If we had a plan for this kind of escape valve for advanced datasource
> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
> focused more on getting the basics right for what many datasources are
> already doing in API V1 combined with other private APIs, vs pushing
> forward state of the art for performance.
>
>
>
> Andrew
>
>
>
> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
> suresh.thalamati@gmail.com> wrote:
>
> +1 (non-binding)
>
>
>
>
>
> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0fan@gmail.com> wrote:
>
>
>
> Hi all,
>
>
>
> In the previous discussion, we decided to split the read and write path of
> data source v2 into 2 SPIPs, and I'm sending this email to call a vote for
> Data Source V2 read path only.
>
>
>
> The full document of the Data Source API V2 is:
>
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
>
>
> The ready-for-review PR that implements the basic infrastructure for the
> read path is:
>
> https://github.com/apache/spark/pull/19136
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
>
>
> +1: Yeah, let's go forward and implement the SPIP.
>
> +0: Don't really care.
>
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> hvanhovell@databricks.com
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>
>
>
> [image: Announcing Databricks Serverless. The first serverless data
> science and big data platform. Watch the demo from Spark Summit 2017.]
> <http://go.databricks.com/announcing-databricks-serverless>
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
>
>

Mime
View raw message