spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] dbtsai commented on a change in pull request #26751: [SPARK-30107][SQL] Expose nested schema pruning to all V2 sources
Date Wed, 11 Dec 2019 01:16:39 GMT
dbtsai commented on a change in pull request #26751: [SPARK-30107][SQL] Expose nested schema
pruning to all V2 sources
URL: https://github.com/apache/spark/pull/26751#discussion_r356361601
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileScanBuilder.scala
 ##########
 @@ -27,15 +27,20 @@ abstract class FileScanBuilder(
     dataSchema: StructType) extends ScanBuilder with SupportsPushDownRequiredColumns {
   private val partitionSchema = fileIndex.partitionSchema
   private val isCaseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis
+  protected val supportsNestedSchemaPruning: Boolean = false
   protected var requiredSchema = StructType(dataSchema.fields ++ partitionSchema.fields)
 
   override def pruneColumns(requiredSchema: StructType): Unit = {
+    // [SPARK-30107] While the passed `requiredSchema` always have pruned nested columns,
the actual
+    // data schema of this scan is determined in `readDataSchema`. File formats that don't
support
+    // nested schema pruning, use `requiredSchema` as a reference and perform the pruning
partially.
     this.requiredSchema = requiredSchema
 
 Review comment:
   Okay, I figure. For those data sources that don't support top level pruning, we will only
return the required top level columns in readDataSchema. I guess in this case, the reader
implementations still read the full data, and handle it internally, but pass less data into
Spark. Wondering why we can not do similar thing in readers for nested data structure?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message