spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
Date Wed, 25 Jan 2017 08:35:27 GMT

     [ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon resolved SPARK-11087.
----------------------------------
    Resolution: Cannot Reproduce


{code}
hive> create table people(name string, address string, phone string) partitioned by(age
int) stored as orc;
OK
Time taken: 4.609 seconds
{code}

{code}
import org.apache.spark.sql.Row
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.types._
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType,FloatType}
val sqlContext = spark.sqlContext

sqlContext.setConf("hive.exec.dynamic.partition.mode","nonstrict")
val records = (1 to 10).map( i => Row(s"name_$i", s"address_$i", s"phone_$i", i ))
val schemaString = "name address phone age"
val schema = StructType(schemaString.split(" ").map(fieldName => if (fieldName.equals("age"))
StructField(fieldName, IntegerType, true) else StructField(fieldName, StringType, true)))
val x = sc.parallelize(records)
val rDF = sqlContext.createDataFrame(x, schema)
rDF.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("age").saveAsTable("people")
sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
val people = sqlContext.read.format("orc").load("spark-warehouse/people")
people.registerTempTable("people")
sqlContext.sql("SELECT * FROM people WHERE age = 3 and name = 'name_3'").explain(true)

{code}

{code}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter (('age = 3) && ('name = name_3))
   +- 'UnresolvedRelation `people`

== Analyzed Logical Plan ==
name: string, address: string, phone: string, age: int
Project [name#68, address#69, phone#70, age#71]
+- Filter ((age#71 = 3) && (name#68 = name_3))
   +- SubqueryAlias people
      +- Relation[name#68,address#69,phone#70,age#71] orc

== Optimized Logical Plan ==
Filter (((isnotnull(age#71) && isnotnull(name#68)) && (age#71 = 3)) &&
(name#68 = name_3))
+- Relation[name#68,address#69,phone#70,age#71] orc

== Physical Plan ==
*Project [name#68, address#69, phone#70, age#71]
+- *Filter (isnotnull(name#68) && (name#68 = name_3))
   +- *FileScan orc [name#68,address#69,phone#70,age#71] Batched: false, Format: ORC, Location:
InMemoryFileIndex[..., PartitionCount: 1, PartitionFilters: [isnotnull(age#71), (age#71 =
3)], PushedFilters: [IsNotNull(name), EqualTo(name,name_3)], ReadSchema: struct<name:string,address:string,phone:string>

{code}

I see it is pushed down in `PushedFilters`.  I am resolving this.

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> ---------------------------------------------------------------------
>
>                 Key: SPARK-11087
>                 URL: https://issues.apache.org/jira/browse/SPARK-11087
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.1
>         Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>            Reporter: patcharee
>            Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table schema below).
I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate was not
generated. 
> Unfortunately my table was not sorted when I inserted the data, but I expected the ORC
pushdown predicate should be generated (because of the where clause) though
> Table schema
> ================================
> hive> describe formatted 4D;
> OK
> # col_name            	data_type           	comment             
> 	 	 
> date                	int                 	                    
> hh                  	int                 	                    
> x                   	int                 	                    
> y                   	int                 	                    
> height              	float               	                    
> u                   	float               	                    
> v                   	float               	                    
> w                   	float               	                    
> ph                  	float               	                    
> phb                 	float               	                    
> t                   	float               	                    
> p                   	float               	                    
> pb                  	float               	                    
> qvapor              	float               	                    
> qgraup              	float               	                    
> qnice               	float               	                    
> qnrain              	float               	                    
> tke_pbl             	float               	                    
> el_pbl              	float               	                    
> qcloud              	float               	                    
> 	 	 
> # Partition Information	 	 
> # col_name            	data_type           	comment             
> 	 	 
> zone                	int                 	                    
> z                   	int                 	                    
> year                	int                 	                    
> month               	int                 	                    
> 	 	 
> # Detailed Table Information	 	 
> Database:           	default             	 
> Owner:              	patcharee           	 
> CreateTime:         	Thu Jul 09 16:46:54 CEST 2015	 
> LastAccessTime:     	UNKNOWN             	 
> Protect Mode:       	None                	 
> Retention:          	0                   	 
> Location:           	hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D	 
> Table Type:         	EXTERNAL_TABLE      	 
> Table Parameters:	 	 
> 	EXTERNAL            	TRUE                
> 	comment             	this table is imported from rwf_data/*/wrf/*
> 	last_modified_by    	patcharee           
> 	last_modified_time  	1439806692          
> 	orc.compress        	ZLIB                
> 	transient_lastDdlTime	1439806692          
> 	 	 
> # Storage Information	 	 
> SerDe Library:      	org.apache.hadoop.hive.ql.io.orc.OrcSerde	 
> InputFormat:        	org.apache.hadoop.hive.ql.io.orc.OrcInputFormat	 
> OutputFormat:       	org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat	 
> Compressed:         	No                  	 
> Num Buckets:        	-1                  	 
> Bucket Columns:     	[]                  	 
> Sort Columns:       	[]                  	 
> Storage Desc Params:	 	 
> 	serialization.format	1                   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> ================================
> Data was inserted into this table by another spark job>
> df.write.format("org.apache.spark.sql.hive.orc.DefaultSource").mode(org.apache.spark.sql.SaveMode.Append).partitionBy("zone","z","year","month").saveAsTable("4D")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message