spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloud-fan <...@git.apache.org>
Subject [GitHub] spark issue #20387: [SPARK-23203][SPARK-23204][SQL]: DataSourceV2: Use immut...
Date Mon, 29 Jan 2018 19:15:35 GMT
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/20387
  
    I dig into the commit history and recalled why I made these decisions:
    
    * having an mutable `DataSourceV2Relation`. This is mostly to avoid to keep adding more
constructor parameters to `DataSourceV2Relation`, make the code easy to maintain. I'm ok to
make it immutable if there is an significant benefit.
    * not using `PhysicalOperation`. This is because we will add more push down optimizations(e.g.
limit, aggregate, join), and we have a specify push down order for them. It's hard to improve
`PhysicalOperation` to support more operators and specific push down orders, so I created
the new one. Eventually all data sources will be implemented as data source v2, so `PhysicalOperation`
will go away.
    
    
    > The output of DataSourceV2Relation should be what is returned by the reader, in case
the reader can only partially satisfy the requested schema projection
    
    Good catch! Since `DataSourceV2Reader` is mutable, the output can't be fixed, as it may
change when we apply data source optimizations. Using `lazy val output ...` can fix this.
    
    
    > The requested projection passed to the DataSourceV2Reader should include filter columns
    
    I did this intentionally. If a column is only refered by pushed filters, Spark doesn't
need this column. Even if we require this column from the data source, we just read it out
and wait it to be pruned by the next operator.
    
    
    > The push-down rule may be run more than once if filters are not pushed through projections
    
    This looks weird, do you have a query to reproduce this issue?
    
    
    > This updates DataFrameReader to parse locations that do not look like paths as table
names and pass the result as "database" and "table" keys in v2 options.
    
    Personally I'd suggest to use `spark.read.format("iceberg").option("table", "db.table").load()`,
as `load` is defined as `def load(paths: String*)`, but I think your usage looks better. The
communition protocol between Spark and data source is options, I'd suggest that we just propogate
the `paths` parameter to options, and data source implementations are free to interprete the
path option to whatever they want, e.g. table and database names.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message