hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Grover <mgro...@oanda.com>
Subject Re: better partitioning strategy in hive
Date Fri, 02 Mar 2012 16:20:51 GMT
Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for ever to query
the list pf partitions from the meta store? getting 43K recs should not be big deal at all??


--> Hive is possibly taking a long time to figure out what partitions it needs to query.
I experienced the same problem when I had a lot of partitions (with relatively small sized
files). I reverted back to having less number of partitions with larger file sizes, that fixed
the problem. Finding the balance between how many partitions you want and how big you want
each partition to be is tricky, but, in general, it's better to have lesser number of partitions.
You want to be aware of the small files problem. It has been discussed at many places. Some
links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of changing the partition
strategy to daily partition instead of hourly. What should be the ideal partitioning strategy?


--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions but same number
of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be as bad.
Since the number of partitions have been reduced by a factor of 24, hive doesn't have to go
through as many number of partitions. However, your queries that look for data in a particular
hour on a given day would be slower now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for your data,
is a part of the design but isn't necessarily linked to the problem in question.

5) When i tried to insert into a different table by selecting from whole days data, hive generate
164mappers with map-only jobs, hence creating many output files. How can force hive to create
one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce
tasks. What i can do to achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no reduce tasks.
You should look into hive.merge.* properties. Setting them in your hive-site.xml would do
the trick. You can see refer to this template (https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template)
to see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "rk vishu" <talk2hadoop@gmail.com>
To: cdh-user@cloudera.org, common-user@hadoop.apache.org, user@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive





Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data
for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the job. I waited
for 20 min and killed it. If i run for a month it takes little time to submit the job, but
at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for ever to query
the list pf partitions from the meta store? getting 43K recs should not be big deal at all??

2) So in order to improve my situation, what are my options? I can think of changing the partition
strategy to daily partition instead of hourly. What should be the ideal partitioning strategy?

3) if we have one partition per day and 24 files under it (i.e less partitions but same number
of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days data, hive generate
164mappers with map-only jobs, hence creating many output files. How can force hive to create
one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce
tasks. What i can do to achieve this? 


-RK 







Mime
View raw message