hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From saquib khan <>
Subject Re: Controlling Number of small files while inserting into Hive table
Date Mon, 26 Jun 2017 02:14:37 GMT
Please remove me from the user list.

On Sun, Jun 25, 2017 at 5:10 PM Db-Blog <> wrote:

> Hi Arpan,
> Include the partition column in the distribute by clause of DML, it will
> generate only one file per day. Hope this will resolve the issue.
> "insert into 'target_table' select a,b,c from x where ... distribute by
> (date)"
> PS: Backdated processing will generate additional file(s). One file per
> load.
> Thanks,
> Saurabh
> Sent from my iPhone, please avoid typos.
> On 22-Jun-2017, at 11:30 AM, Arpan Rajani <>
> wrote:
> Hello everyone,
> I am sure many of you might have faced similar issue.
> We do "insert into 'target_table' select a,b,c from x where .." kind of
> queries for a nightly load. This insert goes in a new partition of the
> target_table.
> Now the concern is : *this inserts load hardly any data* ( I would say
> less than 128 MB per day) *but data is fregmented into1200 files*. Each
> file in a few KiloBytes. This is slowing down the performance. How can we
> make sure, this load does not generate lot of small files?
> I have already set : *hive.merge.mapfiles and **hive.merge.mapredfiles *to
> true in custom/advanced hive-site.xml. But still the load job loads data
> with 1200 small files.
> I know why 1200 is, this is the value of maximum number of
> reducers/containers available in one of the hive-sites. (I do not think its
> a good idea to do cluster wide setting to change this number, as this can
> affect other jobs which can use cluster when it has free containers)
> *What could be other way/settings, so that the hive insert do not take
> 1200 slots and generate lots of small files?*
> I also have another question which is partly contrary to above : (This is
> relatively less important)
> When I reload this table by creating a new table by doing select on target
> table, the newly created table does not contain too many small files. This
> newly created table's number of files drops down from 1200 to ±50. What
> could be the reason?
> PS: I did go through
> Regards,
> Arpan
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.

View raw message