spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From patcharee <Patcharee.Thong...@uni.no>
Subject Re: sql query orc slow
Date Fri, 09 Oct 2015 16:58:12 GMT
Hi Zhan Zhang

Actually my query has WHERE clause "select date, month, year, hh, 
(u*0.9122461 - v*-0.40964267), (v*0.9122461 + u*-0.40964267), z from 4D 
where x = 320 and y = 117 and zone == 2 and year=2009 and z >= 2 and z 
<= 8", column "x", "y" is not partition column, the others are partition 
columns. I expected the system will use predicate pushdown. I turned on 
the debug and found pushdown predicate was not generated ("DEBUG 
OrcInputFormat: No ORC pushdown predicate")

Then I tried to set the search argument explicitly (on the column "x" 
which is not partition column)

     val xs = SearchArgumentFactory.newBuilder().startAnd().equals("x", 
320).end().build()
     hiveContext.setConf("hive.io.file.readcolumn.names", "x")
     hiveContext.setConf("sarg.pushdown", xs.toKryo())

this time in the log pushdown predicate was generated but results was 
wrong (no results at all)

15/10/09 18:36:06 INFO OrcInputFormat: ORC pushdown predicate: leaf-0 = 
(EQUALS x 320)
expr = leaf-0

Any ideas What wrong with this? Why the ORC pushdown predicate is not 
applied by the system?

BR,
Patcharee

On 09. okt. 2015 18:31, Zhan Zhang wrote:
> Hi Patcharee,
>
> >From the query, it looks like only the column pruning will be applied. Partition
pruning and predicate pushdown does not have effect. Do you see big IO difference between
two methods?
>
> The potential reason of the speed difference I can think of may be the different versions
of OrcInputFormat. The hive path may use NewOrcInputFormat, but the spark path use OrcInputFormat.
>
> Thanks.
>
> Zhan Zhang
>
> On Oct 8, 2015, at 11:55 PM, patcharee <Patcharee.Thongtra@uni.no> wrote:
>
>> Yes, the predicate pushdown is enabled, but still take longer time than the first
method
>>
>> BR,
>> Patcharee
>>
>> On 08. okt. 2015 18:43, Zhan Zhang wrote:
>>> Hi Patcharee,
>>>
>>> Did you enable the predicate pushdown in the second method?
>>>
>>> Thanks.
>>>
>>> Zhan Zhang
>>>
>>> On Oct 8, 2015, at 1:43 AM, patcharee <Patcharee.Thongtra@uni.no> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am using spark sql 1.5 to query a hive table stored as partitioned orc
file. We have the total files is about 6000 files and each file size is about 245MB.
>>>>
>>>> What is the difference between these two query methods below:
>>>>
>>>> 1. Using query on hive table directly
>>>>
>>>> hiveContext.sql("select col1, col2 from table1")
>>>>
>>>> 2. Reading from orc file, register temp table and query from the temp table
>>>>
>>>> val c = hiveContext.read.format("orc").load("/apps/hive/warehouse/table1")
>>>> c.registerTempTable("regTable")
>>>> hiveContext.sql("select col1, col2 from regTable")
>>>>
>>>> When the number of files is large (query all from the total 6000 files) ,
the second case is much slower then the first one. Any ideas why?
>>>>
>>>> BR,
>>>>
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message