hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Namit Jain <nj...@facebook.com>
Subject RE: hive query doesn't seem to limit itself to partitions based on the WHERE clause
Date Mon, 04 Oct 2010 05:00:07 GMT
What is your table hourly_fact partitioned on ?

________________________________________
From: Marc Limotte [mslimotte@gmail.com]
Sent: Friday, October 01, 2010 2:10 PM
To: hive-user@hadoop.apache.org
Subject: hive query doesn't seem to limit itself to partitions based on the WHERE clause

Hi,

>From looking at the hive log output, it seems that my job is accessing many more partitions
than it needs to?  For example, my query is something like:

INSERT OVERWRITE TABLE daily_fact
PARTITION (dt='2010-09-29')
SELECT
  20100929 as stamp,
  tagtype,
  country,
  sum(num_requests) AS num_requests
FROM
  hourly_fact HF
WHERE
  (HF.dt = '2010-09-29' AND HF.hr > '07' )
  OR (HF.dt = '2010-09-30' AND HF.hr <= '07' )
GROUP BY
  20100929, tagtype, country

Based on the WHERE clause, I would expect it to look only at partitions in the date range
2010-09-29 08:00:00 through 2010-09-30 07:00:00.  But, the log contains entries like:

10/10/01 19:13:09 INFO exec.ExecDriver: Adding input file hdfs://ny-prod-hc01:9000/home/hadoop/ala/out/hourly/dt=2010-08-15/hr=10

And many other hours outside my WHERE constraint.  I assume this means that it's processing
those directories.  The answer still comes out right, but I'm concerned about the performance.

Would appreciate some help understanding what this means and how to fix it.

Thanks,
Marc



Mime
View raw message