hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moore, Douglas" <>
Subject Re: Hive Insert overwrite creating a single file with large block size
Date Sat, 10 Jan 2015 01:31:10 GMT
There's nothing intrinsically wrong with a large output file that's in a split-able format
such as Avro. Are your downstream queries too slow?
Are you using some kind of compression?

Within an avro file there are blocks of avro objects. Each block can be compressed. Splits
can occur only on a block boundary.
I haven't find out how to set those block sizes from within Hive. We've never had to (from

Generally speaking, you will get 1 file per reducer, to get more reducers, you should define
bucketing on your table. Tune the # buckets to get the files of the size you want?
For your bucket column, pick a high cardinality column that you will likely join on as your

Let us know how it turns out.

- Douglas

From: Slava Markeyev <<>>
Reply-To: <<>>
Date: Fri, 9 Jan 2015 17:04:08 -0800
To: <<>>
Subject: Re: Hive Insert overwrite creating a single file with large block size

You can control block size by setting dfs.block.size. However, I think you might be asking
how to control the size of and number of files generated on insert. Is that correct?

On Fri, Jan 9, 2015 at 4:41 PM, Buntu Dev <<>>
I got a bunch of small Avro files (<5MB) and have a table against those files. I created
a new table and did an 'INSERT OVERWRITE' selecting from the existing table but did not find
any option to provide the file block size. It currently creates a single file per partition.

How do I specify the output block size during the 'INSERT OVERWRITE'?



Slava Markeyev | Engineering | Upsight


View raw message