hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bejoy Ks <>
Subject Hive merge map reduce files - need help in understanding the parameters and full flow
Date Thu, 03 Nov 2011 11:19:57 GMT
Hi Experts
       I'm struck with a problem with merging the smaller output files produced as part
of hive jobs. To test  merging I did set the following parameters

set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=8000000;
set hive.merge.smallfiles.avgsize=2000000;

My understanding is that 

	* every task would give me an output file size of atleast 8MB and
	*  if the average size of final output files is less than the 'hive.merge.smallfiles.avgsize',here
2Mb then a merge job would be done(map only job).By average file size I'm under the assumption
that it is calculated as the sum of all the file sizes divided by the number of files.

But the output file sizes in the output dir doesn't get along with my findings. There are
files with the following sizes
Found 6 items
17946 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000000_0
15951584 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000001_0
131776 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000002_0
7194653 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000003_0
6434 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000005_0
12697784 2011-11-03 05:28 /u/bejoy/external_tables/test_table/dt=2011-10-02/timezone=EDT/000007_0

The File sizes are varying from 15mb to 4 kb. Could some one help me out in understanding
the merge logic and why I'm getting such varying file sizes.

What I was aiming  with this merge test was, in my output table sub directory(as my output
table has multiple levels of partitions) I want to have files whose sizes are always greater
than 8 MB. 
(Now I'm testing with 8mb but in real time production i need to chnage this value to 128 mb)
 .Also it'd be better if the files are of nearly equal sizes
Am I on the right direction to achieve this goal?

I tried setting a few other parameters along with the previous ones like
-hiveconf mapred.min.split.size.per.node=8000000 
-hiveconf mapred.min.split.size.per.rack=8000000 
-hiveconf mapred.max.split.size=8000000 
(FROM a recent JIRA for hive 0.8 we don't need to explicitly do so i believe -
But it is still returning the same result, varying file sizes like above.

I'm on hive 0.7 within CDHu0 environment (hive-hwi-0.7.0-cdh3u0.war).

It would be great if some one could help me in understanding the concept of merging smaller
files in hive map reduce tasks and guide me in accomplishing the desired results with the

Thank you

View raw message