hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <>
Subject Re: Hive produces very small files despite hive.merge...=true settings
Date Thu, 18 Nov 2010 21:12:52 GMT
The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is used to determine
at run time if a merge should be triggered: if the average size of the files in the partition
is SMALLER than the parameter and there are more than 1 file, the merge should be scheduled.
Can you try to see if you have any big files as well in your resulting partition? If it is
because of a very large file, you can set the parameter large enough.

Another possibility is that your Hadoop installation does not support CombineHiveInputFormat,
which is used for the new merge job. Someone reported previously merge was not successful
because of this. If that's the case, you can turn off CombineHiveInputFormat and use the old
HiveInputFormat (though slower) by setting hive.mergejob.maponly=false. 

On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:

> I have jobs that sample (or generate) a small amount of data from a
> large table.  At the end, I get e.g. about 3000 or more files of 1kb
> or so.  This becomes a nuisance.  How can I make Hive do another pass
> to merge the output?  I have the following settings:
> hive.merge.mapfiles=true
> hive.merge.mapredfiles=true
> hive.merge.size.per.task=256000000
> hive.merge.size.smallfiles.avgsize=16000000
> After setting hive.merge* to true, Hive started indicating "Total
> MapReduce jobs = 2".  However, after generating the
> lots-of-small-files table, Hive says:
> Ended Job = job_201011021934_1344
> Ended Job = 781771542, job is filtered out (removed at runtime).
> Is there a way to force the merge, or am I missing something?
> --Leo

View raw message