Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of
 prvs=mgrover=401cbb747@oanda.com designates 98.158.95.75 as permitted sender)
Date: Fri, 02 Mar 2012 11:20:51 -0500 (EST)
From: Mark Grover <mgrover@oanda.com>
To: user@hive.apache.org
Cc: cdh-user@cloudera.org, common-user@hadoop.apache.org
Subject: Re: better partitioning strategy in hive
Message-ID: 
 <4fbbdca5-88f8-4f8b-a9dd-1f19d40fcba9@sms-zimbra-message-store-03.sms.scalar.ca>
In-Reply-To: 
 <CACrbRvYU-myNKZhGjySLc=kn1ojtKc-C6wnJN0UiWp_diazAUg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0

Sorry about the dealyed response, RK.

Here is what I think:
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 

--> Hive is possibly taking a long time to figure out what partitions it needs to query. I experienced the same problem when I had a lot of partitions (with relatively small sized files). I reverted back to having less number of partitions with larger file sizes, that fixed the problem. Finding the balance between how many partitions you want and how big you want each partition to be is tricky, but, in general, it's better to have lesser number of partitions. You want to be aware of the small files problem. It has been discussed at many places. Some links are:
http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html

2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 

--> I would say that's a good step forward.

3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 

--> You probably wouldn't have the same issue; if you still do, it wouldn't be as bad. Since the number of partitions have been reduced by a factor of 24, hive doesn't have to go through as many number of partitions. However, your queries that look for data in a particular hour on a given day would be slower now that you don't have hour as a partition.

4)Are there any special input formats or tricks to handle this? 

--> This is a separate question. What format, SerDe and compression you use for your data, is a part of the design but isn't necessarily linked to the problem in question.

5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 

--> mapred.reduce.tasks wouldn't help because the job is map-only and has no reduce tasks. You should look into hive.merge.* properties. Setting them in your hive-site.xml would do the trick. You can see refer to this template (https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to see what properties exist. 

Good luck!
Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation 

www: oanda.com www: fxtrade.com 
e: mgrover@oanda.com 

"Best Trading Platform" - World Finance's Forex Awards 2009. 
"The One to Watch" - Treasury Today's Adam Smith Awards 2009. 


----- Original Message -----
From: "rk vishu" <talk2hadoop@gmail.com>
To: cdh-user@cloudera.org, common-user@hadoop.apache.org, user@hive.apache.org
Sent: Saturday, February 18, 2012 4:39:48 AM
Subject: Re: better partitioning strategy in hive


Hello All, 

We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. 
So total 43,800 partitions with one file per partition. 

When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. 

Questions: 
1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 
2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 
3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 
4)Are there any special input formats or tricks to handle this? 
5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? 


-RK