arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bobby Evans <reva...@gmail.com>
Subject Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
Date Mon, 22 Apr 2019 18:47:22 GMT
Agreed.

Tom, could you cancel the vote?



On Mon, Apr 22, 2019 at 1:07 PM Reynold Xin <rxin@databricks.com> wrote:

> "if others think it would be helpful, we can cancel this vote, update the
> SPIP to clarify exactly what I am proposing, and then restart the vote
> after we have gotten more agreement on what APIs should be exposed"
>
> That'd be very useful. At least I was confused by what the SPIP was about.
> No point voting on something when there is still a lot of confusion about
> what it is.
>
>
> On Mon, Apr 22, 2019 at 10:58 AM, Bobby Evans <revans2@gmail.com> wrote:
>
>> Xiangrui Meng,
>>
>> I provided some examples in the original discussion thread.
>>
>>
>> https://lists.apache.org/thread.html/f7cdc2cbfb1dafa001422031ff6a3a6dc7b51efc175327b0bbfe620e@%3Cdev.spark.apache.org%3E
>>
>> But the concrete use case that we have is GPU accelerated ETL on Spark.
>> Primarily as data preparation and feature engineering for ML tools like
>> XGBoost, which by the way exposes a Spark specific scala API, not just a
>> python one. We built a proof of concept and saw decent performance gains.
>> Enough gains to more than pay for the added cost of a GPU, with the
>> potential for even better performance in the future. With that proof of
>> concept, we were able to make all of the processing columnar end-to-end for
>> many queries so there really wasn't any data conversion costs to overcome,
>> but we did want the design flexible enough to include a cost-based
>> optimizer. \
>>
>> It looks like there is some confusion around this SPIP especially in how
>> it relates to features in other SPIPs around data exchange between
>> different systems. I didn't want to update the text of this SPIP while it
>> was under an active vote, but if others think it would be helpful, we can
>> cancel this vote, update the SPIP to clarify exactly what I am proposing,
>> and then restart the vote after we have gotten more agreement on what APIs
>> should be exposed.
>>
>> Thanks,
>>
>> Bobby
>>
>> On Mon, Apr 22, 2019 at 10:49 AM Xiangrui Meng <mengxr@gmail.com> wrote:
>>
>> Per Robert's comment on the JIRA, ETL is the main use case for the SPIP.
>> I think the SPIP should list a concrete ETL use case (from POC?) that can
>> benefit from this *public Java/Scala API, *does *vectorization*, and
>> significantly *boosts the performance *even with data conversion overhead.
>>
>> The current mid-term success (Pandas UDF) doesn't match the purpose of
>> SPIP and it can be done without exposing any public APIs.
>>
>> Depending how much benefit it brings, we might agree that a public
>> Java/Scala API is needed. Then we might want to step slightly into how. I
>> saw three options mentioned in the JIRA and discussion threads:
>>
>> 1. Expose `Array[Byte]` in Arrow format. Let user decode it using an
>> Arrow library.
>> 2. Expose `ArrowRecordBatch`. It makes Spark expose third-party APIs.
>> 3. Expose `ColumnarBatch` and make it Arrow-compatible, which is also
>> used by Spark internals. It makes us hard to change Spark internals in the
>> future.
>> 4. Expose something like `SparkRecordBatch` that is Arrow-compatible and
>> maintain conversion between internal `ColumnarBatch` and
>> `SparkRecordBatch`. It might cause conversion overhead in the future if
>> our internal becomes different from Arrow.
>>
>> Note that both 3 and 4 will make many APIs public to be Arrow compatible.
>> So we should really give concrete ETL cases to prove that it is important
>> for us to do so.
>>
>> On Mon, Apr 22, 2019 at 8:27 AM Tom Graves <tgraves_cs@yahoo.com> wrote:
>>
>> Based on there is still discussion and Spark Summit is this week, I'm
>> going to extend the vote til Friday the 26th.
>>
>> Tom
>> On Monday, April 22, 2019, 8:44:00 AM CDT, Bobby Evans <revans2@gmail.com>
>> wrote:
>>
>> Yes, it is technically possible for the layout to change. No, it is not
>> going to happen. It is already baked into several different official
>> libraries which are widely used, not just for holding and processing the
>> data, but also for transfer of the data between the various
>> implementations. There would have to be a really serious reason to force an
>> incompatible change at this point. So in the worst case, we can version the
>> layout and bake that into the API that exposes the internal layout of the
>> data. That way code that wants to program against a JAVA API can do so
>> using the API that Spark provides, those who want to interface with
>> something that expects the data in arrow format will already have to know
>> what version of the format it was programmed against and in the worst case
>> if the layout does change we can support the new layout if needed.
>>
>> On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler <cutlerb@gmail.com> wrote:
>>
>> The Arrow data format is not yet stable, meaning there are no guarantees
>> on backwards/forwards compatibility. Once version 1.0 is released, it will
>> have those guarantees but it's hard to say when that will be. The remaining
>> work to get there can be seen at
>>
>> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone.
>> So yes, it is a risk that exposing Spark data as Arrow could cause an issue
>> if handled by a different version that is not compatible. That being said,
>> changes to format are not taken lightly and are backwards compatible when
>> possible. I think it would be fair to mark the APIs exposing Arrow data as
>> experimental for the time being, and clearly state the version that must be
>> used to be compatible in the docs. Also, adding features like this and
>> SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
>> release. Adding the Arrow dev list to CC.
>>
>> Bryan
>>
>> On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia <matei.zaharia@gmail.com>
>> wrote:
>>
>> Okay, that makes sense, but is the Arrow data format stable? If not, we
>> risk breakage when Arrow changes in the future and some libraries using
>> this feature are begin to use the new Arrow code.
>>
>> Matei
>>
>> On Apr 20, 2019, at 1:39 PM, Bobby Evans <revans2@gmail.com> wrote:
>>
>> I want to be clear that this SPIP is not proposing exposing Arrow
>>
>> APIs/Classes through any Spark APIs. SPARK-24579 is doing that, and
>> because of the overlap between the two SPIPs I scaled this one back to
>> concentrate just on the columnar processing aspects. Sorry for the
>> confusion as I didn't update the JIRA description clearly enough when we
>> adjusted it during the discussion on the JIRA. As part of the columnar
>> processing, we plan on providing arrow formatted data, but that will be
>> exposed through a Spark owned API.
>>
>> On Sat, Apr 20, 2019 at 1:03 PM Matei Zaharia <matei.zaharia@gmail.com>
>>
>> wrote:
>>
>> FYI, I’d also be concerned about exposing the Arrow API or format as a
>>
>> public API if it’s not yet stable. Is stabilization of the API and format
>> coming soon on the roadmap there? Maybe someone can work with the Arrow
>> community to make that happen.
>>
>> We’ve been bitten lots of times by API changes forced by external
>>
>> libraries even when those were widely popular. For example, we used
>> Guava’s Optional for a while, which changed at some point, and we also had
>> issues with Protobuf and Scala itself (especially how Scala’s APIs appear
>> in Java). API breakage might not be as serious in dynamic languages like
>> Python, where you can often keep compatibility with old behaviors, but it
>> really hurts in Java and Scala.
>>
>> The problem is especially bad for us because of two aspects of how
>>
>> Spark is used:
>>
>> 1) Spark is used for production data transformation jobs that people
>>
>> need to keep running for a long time. Nobody wants to make changes to a
>> job that’s been working fine and computing something correctly for years
>> just to get a bug fix from the latest Spark release or whatever. It’s much
>> better if they can upgrade Spark without editing every job.
>>
>> 2) Spark is often used as “glue” to combine data processing code in
>>
>> other libraries, and these might start to require different versions of
>> our dependencies. For example, the Guava class exposed in Spark became a
>> problem when third-party libraries started requiring a new version of
>> Guava: those new libraries just couldn’t work with Spark. Protobuf was
>> especially bad because some users wanted to read data stored as Protobufs
>> (or in a format that uses Protobuf inside), so they needed a different
>> version of the library in their main data processing code.
>>
>> If there was some guarantee that this stuff would remain
>>
>> backward-compatible, we’d be in a much better stuff. It’s not that hard
>> to keep a storage format backward-compatible: just document the format and
>> extend it only in ways that don’t break the meaning of old data (for
>> example, add new version numbers or field types that are read in a
>> different way). It’s a bit harder for a Java API, but maybe Spark could
>> just expose byte arrays directly and work on those if the API is not
>> guaranteed to stay stable (that is, we’d still use our own classes to
>> manipulate the data internally, and end users could use the Arrow library
>> if they want it).
>>
>> Matei
>>
>> On Apr 20, 2019, at 8:38 AM, Bobby Evans <revans2@gmail.com> wrote:
>>
>> I think you misunderstood the point of this SPIP. I responded to your
>>
>> comments in the SPIP JIRA.
>>
>> On Sat, Apr 20, 2019 at 12:52 AM Xiangrui Meng <mengxr@gmail.com>
>>
>> wrote:
>>
>> I posted my comment in the JIRA. Main concerns here:
>>
>> 1. Exposing third-party Java APIs in Spark is risky. Arrow might have
>>
>> 1.0 release someday.
>>
>> 2. ML/DL systems that can benefits from columnar format are mostly in
>>
>> Python.
>>
>> 3. Simple operations, though benefits vectorization, might not be
>>
>> worth the data exchange overhead.
>>
>> So would an improved Pandas UDF API would be good enough? For
>>
>> example, SPARK-26412 (UDF that takes an iterator of of Arrow batches).
>>
>> Sorry that I should join the discussion earlier! Hope it is not too
>>
>> late:)
>>
>> On Fri, Apr 19, 2019 at 1:20 PM <tcondie@gmail.com> wrote:
>> +1 (non-binding) for better columnar data processing support.
>>
>> From: Jules Damji <dmatrix@comcast.net>
>> Sent: Friday, April 19, 2019 12:21 PM
>> To: Bryan Cutler <cutlerb@gmail.com>
>> Cc: Dev <dev@spark.apache.org>
>> Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
>>
>> Columnar Processing Support
>>
>> + (non-binding)
>>
>> Sent from my iPhone
>>
>> Pardon the dumb thumb typos :)
>>
>> On Apr 19, 2019, at 10:30 AM, Bryan Cutler <cutlerb@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>> On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe <jlowe@apache.org> wrote:
>>
>> +1 (non-binding). Looking forward to seeing better support for
>>
>> processing columnar data.
>>
>> Jason
>>
>> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>>
>> <tgraves_cs@yahoo.com.invalid> wrote:
>>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>>
>> extended Columnar Processing Support. The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>>
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until
>>
>> next Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>>
>> [ ] +0
>>
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>>
>> Tom Graves
>>
>> --------------------------------------------------------------------- To
>> unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message