spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Luby <cl...@adobe.com.INVALID>
Subject Using Spark 2.2.0 SparkSession extensions to optimize file filtering
Date Wed, 25 Oct 2017 06:38:58 GMT
I have an external catalog that has additional information on my Parquet files that I want
to match up with the parsed filters from the plan to prune the list of files included in the
scan.  I’m looking at doing this using the Spark 2.2.0 SparkSession extensions similar to
the built in partition pruning:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

and this other project that is along the lines of what I want:

https://github.com/lightcopy/parquet-index/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/IndexSourceStrategy.scala

but isn’t caught up to 2.2.0, but I’m struggling to understand what type of extension
I would use to do something like the above:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSessionExtensions

and if this is the appropriate strategy for this.

Are there any examples out there for using the new extension hooks to alter the files included
in the plan?

Thanks.
Mime
View raw message