hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ning Zhang <nzh...@fb.com>
Subject Re: Hive produces very small files despite hive.merge...=true settings
Date Thu, 18 Nov 2010 23:44:38 GMT
I see. If you are using dynamic partitions, HIVE-1307 and HIVE-1622 need to be there for merging
to take place. HIVE-1307 was committed to trunk on 08/25 and HIVE-1622 was committed on 09/13.
The simplest way is to update your Hive trunk and rerun the query. If it still doesn't work
maybe you can post your query and the result of 'explain <query>' and we can take a
look. 

Ning

On Nov 18, 2010, at 2:57 PM, Leo Alekseyev wrote:

> Hi Ning,
> For the dataset I'm experimenting with, the total size of the output
> is 2mb, and the files are at most a few kb in size.  My
> hive.input.format was set to default HiveInputFormat; however, when I
> set it to CombineHiveInputFormat, it only made the first stage of the
> job use fewer mappers.  The merge job was *still* filtered out at
> runtime.  I also tried set hive.mergejob.maponly=false; that didn't
> have any effect.
> 
> I am a bit at a loss what to do here.  Is there a way to see what's
> going on exactly using e.g. debug log levels?..  Btw, I'm also using
> dynamic partitions; could that somehow be interfering with the merge
> job?..
> 
> I'm running a relatively fresh Hive from trunk (built maybe a month ago).
> 
> --Leo
> 
> On Thu, Nov 18, 2010 at 1:12 PM, Ning Zhang <nzhang@fb.com> wrote:
>> The settings looks good. The parameter hive.merge.size.smallfiles.avgsize is used
to determine at run time if a merge should be triggered: if the average size of the files
in the partition is SMALLER than the parameter and there are more than 1 file, the merge should
be scheduled. Can you try to see if you have any big files as well in your resulting partition?
If it is because of a very large file, you can set the parameter large enough.
>> 
>> Another possibility is that your Hadoop installation does not support CombineHiveInputFormat,
which is used for the new merge job. Someone reported previously merge was not successful
because of this. If that's the case, you can turn off CombineHiveInputFormat and use the old
HiveInputFormat (though slower) by setting hive.mergejob.maponly=false.
>> 
>> Ning
>> On Nov 17, 2010, at 6:00 PM, Leo Alekseyev wrote:
>> 
>>> I have jobs that sample (or generate) a small amount of data from a
>>> large table.  At the end, I get e.g. about 3000 or more files of 1kb
>>> or so.  This becomes a nuisance.  How can I make Hive do another pass
>>> to merge the output?  I have the following settings:
>>> 
>>> hive.merge.mapfiles=true
>>> hive.merge.mapredfiles=true
>>> hive.merge.size.per.task=256000000
>>> hive.merge.size.smallfiles.avgsize=16000000
>>> 
>>> After setting hive.merge* to true, Hive started indicating "Total
>>> MapReduce jobs = 2".  However, after generating the
>>> lots-of-small-files table, Hive says:
>>> Ended Job = job_201011021934_1344
>>> Ended Job = 781771542, job is filtered out (removed at runtime).
>>> 
>>> Is there a way to force the merge, or am I missing something?
>>> --Leo
>> 
>> 


Mime
View raw message