hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Igor Tatarinov <>
Subject Re: single output file per partition?
Date Wed, 21 Aug 2013 18:12:02 GMT
Using a single bucket per partition seems to create a single reducer which
is too slow.
I've tried enforcing small files merge but that didn't work. I still got
multiple output files.

Creating a temp table and then "combining" the multiple files into one
using a simple select * is the only option that seems to work. It's odd
that I have to create the temp table but I don't see a workaround.

On Wed, Aug 21, 2013 at 8:51 AM, Stephen Sprague <> wrote:

> hi igor,
> lots of ideas there!  I can't speak for them all but let me confirm first
> that "cluster by X into 1 bucket" didn't work?  I would have thought that
> would have done it.
> On Tue, Aug 20, 2013 at 2:29 PM, Igor Tatarinov <> wrote:
>> What's the best way to enforce a single output file per partition?
>> PARTITION (x,y,z)
>> SELECT ...
>> FROM ...
>> WHERE ...
>> It tried adding CLUSTER BY x,y,z at the end thinking that sorting will
>> force a single reducer per partition but that didn't work. I still got
>> multiple files per partition.
>> Do I have to use a single reduce task? With a few TB of data that's
>> probably not a good idea.
>> My current idea is to create a temp table with the same partitioning
>> structure. Insert into that table first and then select * from that table
>> into the output table. With combineinputformat=true that should work right?
>> Or should I make Hive merge output files instead? (using hive.merge.mapfiles)
>> Will that work with a partitioned table?
>> Thanks!
>> igor

View raw message