hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rk vishu <>
Subject Re: better partitioning strategy in hive
Date Sat, 18 Feb 2012 09:39:48 GMT
> Hello All,
> We have a hive table partitioned by date and hour(330 columns). We have 5
> years worth of data for the table. Each hourly partition have around 800MB.
> So total 43,800 partitions with one file per partition.
> When we run select count(*) from table, hive is taking for ever to submit
> the job. I waited for 20 min and killed it. If i run for a month it takes
> little time to submit the job, but at least hive is able to get the work
> done?.
> Questions:
> 1) first of all why hive is not able to even submit the job? Is it taking
> for ever to query the list pf partitions from the meta store? getting 43K
> recs should not be big deal at all??
> 2) So in order to improve my situation, what are my options? I can think
> of changing the partition strategy to daily partition instead of hourly.
> What should be the ideal partitioning strategy?
> 3) if we have one partition per day and 24 files under it (i.e less
> partitions but same number of files), will it improve anything or i will
> have same issue ?
> 4)Are there any special input formats or tricks to handle this?
> 5) When i tried to insert into a different table by selecting from whole
> days data, hive generate 164mappers with map-only jobs, hence creating many
> output files. How can force hive to create one output file instead of many.
> Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i
> can do to achieve this?
> -RK

View raw message