spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] rdblue commented on pull request #29066: [SPARK-23889][SQL] DataSourceV2: required sorting and clustering for writes
Date Mon, 30 Nov 2020 17:54:38 GMT

rdblue commented on pull request #29066:
URL: https://github.com/apache/spark/pull/29066#issuecomment-735942977


   > I am interested in what other devs think and whether we are OK breaking the existing
API.
   
   Since the other API is targeted at the read path, I would have no problem adding this one
in parallel under a `write` package. I think that we should deprecate the read-side distribution
because it doesn't really help with bucketed joins.
   
   I'm also fine changing the existing API, but I'd rather just deprecate it and remove it
when we have a replacement for bucketed joins and other read-side optimizations.
   
   > Probably worth to raise a discussion in dev@ mailing list?
   
   Yes. But if we want to get this into 3.1.0, we should start moving on everything in parallel.
We should start getting the addition of `Write` done because it needs to carry the `RequiresDistributionAndSort`
interface no matter what we decide about `Distribution`. And we can at least get a WIP PR
up to add the new distribution interfaces.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message