hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Db-Blog <mpp.databa...@gmail.com>
Subject Re: Controlling Number of small files while inserting into Hive table
Date Sun, 25 Jun 2017 21:09:55 GMT
Hi Arpan,
Include the partition column in the distribute by clause of DML, it will generate only one
file per day. Hope this will resolve the issue. 

> "insert into 'target_table' select a,b,c from x where ... distribute by (date)"
> 
PS: Backdated processing will generate additional file(s). One file per load. 

Thanks,
Saurabh

Sent from my iPhone, please avoid typos.

> On 22-Jun-2017, at 11:30 AM, Arpan Rajani <arpan.rajani@whishworks.com> wrote:
> 
> Hello everyone,
> 
> 
> 
> I am sure many of you might have faced similar issue.
> 
> We do "insert into 'target_table' select a,b,c from x where .." kind of queries for a
nightly load. This insert goes in a new partition of the target_table. 
> 
> Now the concern is : this inserts load hardly any data ( I would say less than 128 MB
per day) but data is fregmented into1200 files. Each file in a few KiloBytes. This is slowing
down the performance. How can we make sure, this load does not generate lot of small files?
> 
> I have already set : hive.merge.mapfiles and hive.merge.mapredfiles to true in custom/advanced
hive-site.xml. But still the load job loads data with 1200 small files. 
> 
> I know why 1200 is, this is the value of maximum number of reducers/containers available
in one of the hive-sites. (I do not think its a good idea to do cluster wide setting to change
this number, as this can affect other jobs which can use cluster when it has free containers)

> 
> What could be other way/settings, so that the hive insert do not take 1200 slots and
generate lots of small files?
> 
> I also have another question which is partly contrary to above : (This is relatively
less important)
> 
> When I reload this table by creating a new table by doing select on target table, the
newly created table does not contain too many small files. This newly created table's number
of files drops down from 1200 to ±50. What could be the reason?
> 
> PS: I did go through http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html
> 
> 
> 
> Regards,
> Arpan
> 
> The contents of this e-mail are confidential and for the exclusive use of the intended
recipient. If you receive this e-mail in error please delete it from your system immediately
and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose
the content of the e-mail. The views expressed in this communication may not necessarily be
the view held by WHISHWORKS.

Mime
View raw message